California Housing Price Prediction Using Machine Learning: A Comparative Study Using Feature Engineering and Ensemble Methods

Mohammad Siwad; Mutaz Rasmi Abu Sara; Ahlam Awwad; Mohammad F. J. Klaib

doi:10.59994/ajbtme.2026.3.29

Authors

Mohammad Siwad Palestine Ahliya University (Palestine)
Mutaz Rasmi Abu Sara Faculty of Engineering and Information Technology, Palestine Ahliya University (Palestine)
Ahlam Awwad Faculty of Engineering and Information Technology, Palestine Ahliya University (Palestine)
Mohammad F. J. Klaib Intelligent Systems Engineering, Middle East University (Jordan)

DOI:

https://doi.org/10.59994/ajbtme.2026.3.29

Keywords:

California Housing Dataset, Ensemble Learning, Feature Engineering, Housing Price Prediction, Hyperparameter Optimization, Machine Learning

Abstract

This study aimed to develop an accurate housing price prediction model for California by comparing multiple machines learning algorithms and evaluating the impact of feature engineering and ensemble techniques on predictive performance. The study utilized the California Housing Dataset, comprising 20,433 observations after data cleaning and preprocessing. A five-stage methodology was implemented, including the evaluation of seven baseline regression models, generation of engineered features, feature selection using the F-statistic method, hyperparameter optimization of the best-performing models through GridSearchCV, and the construction of Voting and Stacking ensemble models. The findings revealed that linear models achieved limited performance due to the nonlinear relationships among variables, whereas tree-based and ensemble methods demonstrated superior predictive capabilities. The Stacking Ensemble model achieved the highest performance with an R² value of 0.8431, an RMSE of $46,317, and an MAE of $30,150. Furthermore, the results confirmed that engineered features, particularly rooms per household, played a significant role in enhancing prediction accuracy. The scientific contribution of this study lies in proposing an integrated framework that combines feature engineering, feature selection, hyperparameter optimization, and advanced ensemble learning within a unified comparative environment. This approach improved predictive performance and surpassed the widely recognized benchmark by 3.21 percentage points in terms of R², while highlighting the importance of household-level feature normalization in housing price prediction.

References

Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.

Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of statistics, 1189-1232.

Géron, A. (2022). Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow. " O'Reilly Media, Inc.".

Pace, R. K., & Barry, R. (1997). Sparse spatial autoregressions. Statistics & Probability Letters, 33(3), 291-297.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12, 2825-2830.

Polanitzer, R. (2022, March 12). Machine learning for California housing. Medium. https://medium.com/

Sharma, H., Harsora, H., & Ogunleye, B. (2024). An optimal house price prediction algorithm: XGBoost. Analytics, 3(1), 30-45.

Wu, J., Chen, X. Y., Zhang, H., Xiong, L. D., Lei, H., & Deng, S. H. (2019). Hyperparameter optimization for machine learning models based on Bayesian optimization. Journal of Electronic Science and Technology, 17(1), 26-40.

California Housing Price Prediction Using Machine Learning: A Comparative Study Using Feature Engineering and Ensemble Methods

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

Current Issue

Call for Papers

Make a Submission