Enhancing Breast Cancer Subtype Classification through GediNET: Integrating Disease- Disease Association Data with a Grouping-Scoring-Modeling Approach
DOI:
https://doi.org/10.59994/ajamts.2024.1.1Abstract
The development of sequencing technology and the increase in biological data repositories have allowed for a more thorough understanding of the complex molecular aspects of diseases like cancer. This paper evaluates GediNET, an integrative machine learning approach that employs a Grouping-Scoring-Modeling (GSM) approach to classify different molecular subtypes of breast cancer using the BRCA LumAB_Her2Basal dataset against different feature selection approaches and machine learning classifiers. GediNET distinguishes itself from traditional feature selection methods by analyzing groups of genes to identify relevant disease-disease associations and potential biomarkers. The results of our study show that GediNET performs better than traditional approaches in terms of accuracy and Area Under the Curve (AUC) metrics. This demonstrates that GediNET is effective in understanding the genetic intricacies of breast cancer. This approach improves the identification of molecular subtypes and promotes the development of targeted medicines and customized medicine.
Keywords:
GediNET, Breast Cancer Subtype Classification, Grouping-Scoring-Modeling Approach, Machine Learning, Disease-Disease Associations, Biomarker Discovery, Genomic Data AnalysisReferences
H. P. J. Buermans and J. T. den Dunnen, “Next generation sequencing technology: Advances and applications,” Biochimica et Biophysica Acta (BBA) - Molecular Basis of Disease, vol. 1842, no. 10, pp. 1932–1941, Oct. 2014, doi: 10.1016/j.bbadis.2014.06.015.
G. Parmigiani, E. S. Garrett, R. A. Irizarry, and S. L. Zeger, “The Analysis of Gene Expression Data: An Overview of Methods and Software,” in The Analysis of Gene Expression Data: Methods and Software, G. Parmigiani, E. S. Garrett, R. A. Irizarry, and S. L. Zeger, Eds., New York, NY: Springer, 2003, pp. 1–45. doi: 10.1007/0-387-21679-0_1.
“miRTarBase 2016: updates to the experimentally validated miRNA-target interactions database | Nucleic Acids Research | Oxford Academic.” Accessed: Nov. 30,2021. [Online]. Available: https://academic.oup.com/nar/article/44/D1/D239/2503072
The Gene Ontology Consortium et al., “The Gene Ontology knowledgebase in 2023,” Genetics, vol. 224, no. 1, p. iyad031, May 2023, doi: 10.1093/genetics/iyad031.
E. Clough and T. Barrett, “The Gene Expression Omnibus Database,” Methods Mol Biol, vol. 1418, pp. 93–110, Jan. 2016, doi: 10.1007/978-1-4939-3578-9_5.
K. Tomczak, P. Czerwińska, and M. Wiznerowicz, “The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge,” Contemp Oncol (Pozn), vol. 19, no. 1A, pp. A68-77, 2015, doi: 10.5114/wo.2014.47136.
J. Piñero et al., “DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants,” Nucleic Acids Res, vol. 45, no. D1, pp. D833–D839, Jan. 2017, doi: 10.1093/nar/gkw943.
K. S. Johnson, E. F. Conant, and M. S. Soo, “Molecular Subtypes of Breast Cancer: A Review for Breast Radiologists,” Journal of Breast Imaging, vol. 3, no. 1, pp. 12–24, Jan. 2021, doi: 10.1093/jbi/wbaa110.
J.-C. Neel and J.-J. Lebrun, “Activin and TGFβ regulate expression of the microRNA-181 family to promote cell migration and invasion in breast cancer cells,” Cell Signal, vol. 25, no. 7, pp. 1556–1566, Jul. 2013, doi: 10.1016/j.cellsig.2013.03.013.
M. Oti and H. Brunner, “The modular nature of genetic diseases,” Clinical Genetics, vol. 71, no. 1, pp. 1–11, 2007, doi: 10.1111/j.1399-0004.2006.00708.x.
B. Mirza, W. Wang, J. Wang, H. Choi, N. C. Chung, and P. Ping, “Machine Learning and Integrative Analysis of Biomedical Big Data,” Genes, vol. 10, no. 2, p. 87, Jan. 2019, doi: 10.3390/genes10020087.
F. Curion and F. J. Theis, “Machine learning integrative approaches to advance computational immunology,” Genome Medicine, vol. 16, 2024, doi: 10.1186/s13073-024-01350-3.
A. Holzinger, R. Goebel, V. Palade, and M. Ferri, “Towards Integrative Machine Learning and Knowledge Extraction,” in Towards Integrative Machine Learning and Knowledge Extraction, A. Holzinger, R. Goebel, M. Ferri, and V. Palade, Eds., Cham: Springer International Publishing, 2017, pp. 1–12. doi: 10.1007/978-3-319-69775-8_1.
M. Yousef, A. Kumar, and B. Bakir-Gungor, “Application of Biological Domain Knowledge Based Feature Selection on Gene Expression Data,” Entropy (Basel), vol. 23, no. 1, p. E2, Dec. 2020, doi: 10.3390/e23010002.
C. Kuzudisli, B. Bakir-Gungor, N. Bulut, B. Qaqish, and M. Yousef, “Review of feature selection approaches based on grouping of features,” PeerJ, vol. 11, p. e15666, Jul. 2023, doi: 10.7717/peerj.15666.
M. Yousef, B. Bakir-Gungor, A. Jabeer, G. Goy, R. Qureshi, and L. C. Showe, “Recursive Cluster Elimination based Rank Function (SVM-RCE-R) implemented in KNIME,” F1000Res, vol. 9, p. 1255, Jan. 2021, doi: 10.12688/f1000research.26880.2.
M. Yousef, L. Abdallah, and J. Allmer, “maTE: discovering expressed interactions between microRNAs and their targets,” Bioinformatics, vol. 35, no. 20, pp. 4020–4028, Oct. 2019, doi: 10.1093/bioinformatics/btz204.
M. Yousef, G. Goy, R. Mitra, C. M. Eischen, A. Jabeer, and B. Bakir-Gungor, “miRcorrNet: machine learning-based integration of miRNA and mRNA expression profiles, combined with feature grouping and ranking,” PeerJ, vol. 9, p. e11458, May 2021, doi: 10.7717/peerj.11458.
M. Yousef, F. Ozdemir, A. Jaaber, J. Allmer, and B. Bakir-Gungor, “PriPath: Identifying Dysregulated Pathways from Differential Gene Expression via Grouping, Scoring and Modeling with an Embedded Machine Learning Approach,” In Review, preprint, Apr. 2022. doi: 10.21203/rs.3.rs-1449467/v1.
E. Qumsiyeh, Z. Salah, and M. Yousef, “miRGediNET: A comprehensive examination of common genes in miRNA-Target interactions and disease associations: Insights from a grouping-scoring-modeling approach,” Heliyon, vol. 9, no. 12, p. e22666, Dec. 2023, doi: 10.1016/j.heliyon.2023.e22666.
A. Jabeer, M. Temiz, B. Bakir-Gungor, and M. Yousef, “miRdisNET: Discovering microRNA biomarkers that are associated with diseases utilizing biological knowledge-based machine learning,” Frontiers in Genetics, vol. 13, 2023, Accessed: Jul. 07, 2023. [Online]. Available: https://www.frontiersin.org/articles/10.3389/fgene.2022.1076554
M. Yousef, B. Bakir-Gungor, A. Jabeer, G. Goy, R. Qureshi, and L. C Showe, “Recursive Cluster Elimination based Rank Function (SVM-RCE-R) implemented in KNIME,” F1000Res, vol. 9, p. 1255, 2020, doi: 10.12688/f1000research.26880.2.
M. Unlu Yazici, J. S. Marron, B. Bakir-Gungor, F. Zou, and M. Yousef, “Invention of 3Mint for feature grouping and scoring in multi-omics,” Frontiers in Genetics, vol. 14, 2023, Accessed: Feb. 12, 2024. [Online]. Available: https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2023.1093326
M. Yousef, G. Goy, and B. Bakir-Gungor, “miRModuleNet: Detecting miRNA-mRNA Regulatory Modules,” Front Genet, vol. 13, p. 767455, 2022, doi: 10.3389/fgene.2022.767455.
M. Yousef, E. Ülgen, and O. Uğur Sezerman, “CogNet: classification of gene expression data based on ranked active-subnetwork-oriented KEGG pathway enrichment analysis,” PeerJ Comput Sci, vol. 7, p. e336, 2021, doi: 10.7717/peerj-cs.336.
E. Qumsiyeh, Z. Salah, and M. Yousef, “miRGediNET: A comprehensive examination of common genes in miRNA-Target interactions and disease associations: Insights from a grouping-scoring-modeling approach,” Heliyon, vol. 9, no. 12, p. e22666, Dec. 2023, doi: 10.1016/j.heliyon.2023.e22666.
B. Bakir-Gungor, M. Temiz, A. Jabeer, D. Wu, and M. Yousef, “microBiomeGSM: the identification of taxonomic biomarkers from metagenomic data using grouping, scoring and modeling (G-S-M) approach,” Front Microbiol, vol. 14, p. 1264941, Nov. 2023, doi: 10.3389/fmicb.2023.1264941.
E. Qumsiyeh, L. Showe, and M. Yousef, “GediNET for discovering gene associations across diseases using knowledge based machine learning approach,” Sci Rep, vol. 12, no. 1, Art. no. 1, Nov. 2022, doi: 10.1038/s41598-022-24421-0.
E. Qumsiyeh, M. Yazıcı, and M. Yousef, “GediNETPro: Discovering Patterns of Disease Groups,” in Proceedings of the 16th International Joint Conference on Biomedical Engineering Systems and Technologies - BIOINFORMATICS, SciTePress, 2023, pp. 195–203. doi: 10.5220/0011690800003414.
E. Qumsiyeh, M. Yousef, Z. Salah, and R. Jayousi, “Detecting Semantic Similarity of Diseases based Machine Learning,” in 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Dec. 2023, pp. 3118–3124. doi: 10.1109/BIBM58861.2023.10385728.
M. J. Goldman et al., “Visualizing and interpreting cancer genomics data via the Xena platform,” Nat Biotechnol, vol. 38, no. 6, pp. 675–678, Jun. 2020, doi: 10.1038/s41587-020-0546-8.
M. D. Robinson, D. J. McCarthy, and G. K. Smyth, “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data,” Bioinformatics, vol. 26, no. 1, pp. 139–140, Jan. 2010, doi: 10.1093/bioinformatics/btp616.
Q.-S. Xu and Y.-Z. Liang, “Monte Carlo cross validation,” Chemometrics and Intelligent Laboratory Systems, vol. 56, no. 1, pp. 1–11, Apr. 2001, doi: 10.1016/S0169-7439(00)00122-2.
Hanchuan Peng, Fuhui Long, and C. Ding, “Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy,” IEEE Trans. Pattern Anal. Machine Intell., vol. 27, no. 8, pp. 1226–1238, Aug. 2005, doi: 10.1109/TPAMI.2005.159.
C. Ding and H. Peng, “Minimum redundancy feature selection from microarray gene expression data,” J. Bioinform. Comput. Biol., vol. 03, no. 02, pp. 185–205, Apr. 2005, doi: 10.1142/S0219720005001004.
J. T. Kent, “Information gain and a general measure of correlation,” Biometrika, vol. 70, no. 1, pp. 163–173, 1983, doi: 10.1093/biomet/70.1.163.
T. Desyani, A. Saifudin, and Y. Yulianti, “Feature Selection Based on Naive Bayes for Caesarean Section Prediction,” IOP Conf. Ser.: Mater. Sci. Eng., vol. 879, no. 1, p. 012091, Jul. 2020, doi: 10.1088/1757-899X/879/1/012091.
L. Yu and H. Liu, “Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution,” Proceedings of the Twentieth International Conference on Machine Learning, vol. Washington DC, 2003.
G. Ke et al., “LightGBM: A Highly Efficient Gradient Boosting Decision Tree,” in Advances in Neural Information Processing Systems, Curran Associates, Inc., 2017. Accessed: Oct. 25, 2023. [Online]. Available: https://proceedings.neurips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html
T. K. Ho, “Random decision forests,” in Proceedings of 3rd International Conference on Document Analysis and Recognition, Aug. 1995, pp. 278–282 vol.1. doi: 10.1109/ICDAR.1995.598994.
M. Yousef, M. Ketany, L. Manevitz, L. C. Showe, and M. K. Showe, “Classification and biomarker identification using gene network modules and support vector machines,” BMC Bioinformatics, vol. 10, no. 1, p. 337, Oct. 2009, doi: 10.1186/1471-2105-10-337.
M. H. Kamarudin, C. Maple, T. Watson, and N. S. Safa, “A LogitBoost-Based Algorithm for Detecting Known and Unknown Web Attacks,” IEEE Access, vol. 5, pp. 26190–26200, 2017, doi: 10.1109/ACCESS.2017.2766844.
R. Wang, “AdaBoost for Feature Selection, Classification and Its Relation with SVM, A Review,” Physics Procedia, vol. 25, pp. 800–807, Jan. 2012, doi: 10.1016/j.phpro.2012.03.160.
R. Kolde, S. Laur, P. Adler, and J. Vilo, “Robust rank aggregation for gene list integration and meta-analysis,” Bioinformatics, vol. 28, no. 4, pp. 573–580, Feb. 2012, doi: 10.1093/bioinformatics/btr709.