California Housing Price Prediction Using Machine Learning: A Comparative Study Using Feature Engineering and Ensemble Methods
DOI:
https://doi.org/10.59994/ajbtme.2026.3.29Keywords:
California Housing Dataset, Ensemble Learning, Feature Engineering, Housing Price Prediction, Hyperparameter Optimization, Machine LearningAbstract
This study aimed to develop an accurate housing price prediction model for California by comparing multiple machines learning algorithms and evaluating the impact of feature engineering and ensemble techniques on predictive performance. The study utilized the California Housing Dataset, comprising 20,433 observations after data cleaning and preprocessing. A five-stage methodology was implemented, including the evaluation of seven baseline regression models, generation of engineered features, feature selection using the F-statistic method, hyperparameter optimization of the best-performing models through GridSearchCV, and the construction of Voting and Stacking ensemble models. The findings revealed that linear models achieved limited performance due to the nonlinear relationships among variables, whereas tree-based and ensemble methods demonstrated superior predictive capabilities. The Stacking Ensemble model achieved the highest performance with an R² value of 0.8431, an RMSE of $46,317, and an MAE of $30,150. Furthermore, the results confirmed that engineered features, particularly rooms per household, played a significant role in enhancing prediction accuracy. The scientific contribution of this study lies in proposing an integrated framework that combines feature engineering, feature selection, hyperparameter optimization, and advanced ensemble learning within a unified comparative environment. This approach improved predictive performance and surpassed the widely recognized benchmark by 3.21 percentage points in terms of R², while highlighting the importance of household-level feature normalization in housing price prediction.
References
Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.
Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of statistics, 1189-1232.
Géron, A. (2022). Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow. " O'Reilly Media, Inc.".
Pace, R. K., & Barry, R. (1997). Sparse spatial autoregressions. Statistics & Probability Letters, 33(3), 291-297.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12, 2825-2830.
Polanitzer, R. (2022, March 12). Machine learning for California housing. Medium. https://medium.com/
Sharma, H., Harsora, H., & Ogunleye, B. (2024). An optimal house price prediction algorithm: XGBoost. Analytics, 3(1), 30-45.
Wu, J., Chen, X. Y., Zhang, H., Xiong, L. D., Lei, H., & Deng, S. H. (2019). Hyperparameter optimization for machine learning models based on Bayesian optimization. Journal of Electronic Science and Technology, 17(1), 26-40.