The Impact of Advanced Preprocessing Techniques on Machine Learning Models for Income Prediction

Authors

  • Mutaz Rasmi Abu Sara Faculty of Engineering and Information Technology, Palestine Ahliya University (Palestine)
  • Andres Emaya Palestine Ahliya University (Palestine)

DOI:

https://doi.org/10.59994/ajbtme.2026.3.39

Keywords:

Data Preprocessing, Machine Learning Models, Adult-Income Dataset, Feature Engineering, Histogram-Based Gradient Boosting, Binary Classification

Abstract

Data preprocessing plays a fundamental role in improving the reliability and predictive performance of machine learning models, particularly when dealing with real-world tabular datasets containing missing values, outliers, redundant features, skewed distributions, and heterogeneous data types. This study investigates the impact of advanced preprocessing techniques on income prediction using the Adult Income dataset. A comprehensive preprocessing pipeline was developed by integrating missing value imputation, variance-based feature selection, IQR-based outlier treatment, Yeo–Johnson transformation, standard scaling, one-hot encoding, Singular Value Decomposition (SVD), and a Column Transformer-based workflow to eliminate data leakage. Four machine learning models—Histogram-Based Gradient Boosting (HistGB), Random Forest, Logistic Regression, and Linear Support Vector Classifier—were trained and evaluated using stratified k-fold cross-validation and an 80/20 train-test split. Performance was assessed using Accuracy, Precision, Recall, F1-score, ROC-AUC, and Log-loss. The results demonstrate that the proposed preprocessing pipeline consistently improved the performance of all models, with Histogram-Based Gradient Boosting achieving the highest test accuracy of 86.8% and a ROC-AUC of 92.1%, indicating excellent predictive capability and strong generalization with minimal overfitting. The originality of this study lies in the development of a unified and reproducible preprocessing framework that systematically integrates multiple advanced preprocessing techniques and applies them consistently across different machine learning models, enabling a fair comparative evaluation. Unlike previous studies that primarily emphasize algorithm selection, this research demonstrates that a carefully designed preprocessing pipeline can substantially enhance predictive performance and produce competitive results without relying on complex ensemble architectures or extensive model tuning.

References

Becker, B. & Kohavi, R. (1996). Adult [Dataset]. UCI Machine Learning Repository.

Chakrabarty, N., & Biswas, S. (2018, October). A statistical approach to adult census income level prediction. In 2018 International Conference on Advances in Computing, Communication Control and Networking (ICACCCN) (pp. 207-212). IEEE.

Islam, M. A., Nag, A., Roy, N., Dey, A. R., Fahim, S. F. A., & Ghosh, A. (2023, November). An investigation into the prediction of annual income levels through the utilization of demographic features employing the modified UCI adult dataset. In 2023 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS) (pp. 1080-1086). IEEE.

Jo, K. (2024). Income prediction using machine learning techniques [Master’s thesis, University of California, Los Angeles]. eScholarship. https://escholarship.org/uc/item/6d01c9v7

Kuhn, M., & Johnson, K. (2013). Applied predictive modeling. Springer.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12, 2825-2830.

Thapa, S. (2023). Adult income prediction using various ML algorithms. Available at SSRN 4325813.

Yeo, I. K., & Johnson, R. A. (2000). A new family of power transformations to improve normality or symmetry. Biometrika, 87(4), 954-959.

Downloads

Published

2026-05-31

How to Cite

Abu Sara, M. R., & Emaya, A. (2026). The Impact of Advanced Preprocessing Techniques on Machine Learning Models for Income Prediction. Ahliya Journal of Business Technology and MEAN Economies , 3(1), 39–46. https://doi.org/10.59994/ajbtme.2026.3.39

Issue

Section

Articles