Investigating the Usage of Random Forest Method on Next-Generation Sequencing Data to Predict MSH2and MSH6 Associated Mutations
Random Forest Method to Predict MSH2and MSH6 Associated Mutations
DOI:
https://doi.org/10.54393/fbt.v5i1.131Keywords:
Colorectal cancer (CRC), Random Forest, Machine Learning, MSH2, MSH6, DNA, Diagnosis, NGS, MutationAbstract
Colorectal cancer (CRC) is one of the most prevalent cancers and the second leading cause of cancer-related deaths globally. Germline mutations in CRC are associated with the MSH2 and MSH6 genes, which prevent infection for the DNA MMR pathway. Objectives: To enhance CRC-related prediction of mutations using the Random Forest algorithm on NGS data of MSH2 and MSH6 gene. Given the tremendous amount of genetic information obtained from NGS, a model for the early diagnosis and individual treatment of CRC is necessary. Methods: The raw sequencing data of MSH2 and MSH6 genes were meticulously downloaded from the NCBI's SRA database. The three datasets of 1000, 2000, and 3000 sequences were carefully analyzed to assess genomic features, including ORF count, nucleotide content, AT/CG ratio, G-quadruplex signal, and mutation rates, to understand their correlation with colorectal cancer. The data were then divided into a training set (80%) and a test set (20%) for model training and testing in Python, employing the Biopython package for mutation analysis and feature extraction. The model was rigorously evaluated using accuracy, confusion matrix, and classification report, instilling confidence in the research process for accurate CRC mutation prediction. Results: The Random Forest model yielded high accuracy of 96.25%, 98.37%, and 99. 5% for the datasets of 1000, 2000, and 3000 sequences, respectively. The confusion matrix showed that the model was very accurate in identifying true negatives, especially in the large data set. Conclusions: The study successfully applied the Random Forest algorithm to predict CRC using NGS data of MSH2 and MSH6 gene mutations. The model's potential to revolutionize CRC research is both exciting and optimistic.
References
Pan H, Zhao Z, Deng Y, Zheng Z, Huang Y, Huang S, et al. The Global, Regional, And National Early-Onset Colorectal Cancer Burden and Trends From 1990 To 2019: Results from The Global Burden of Disease Study 2019. Bmc Public Health. 2022; 22(1): 1896. doi:10.1186/s12889-022-14274-7.
Idrees R, Fatima S, Abdul-Ghafar J, Raheem A, Ahmad Z. Cancer Prevalence in Pakistan: Meta-Analysis of Various Published Studies to Determine Variation in Cancer Figures Resulting from Marked Population Heterogeneity In Different Parts Of The Country. World Journal of Surgical Oncology. 2018; 16: 1-11. doi:10.1186/s12957-018-1429-z.
Kastrinos F, Samadder NJ, Burt RW. Use Of Family History and Genetic Testing to Determine Risk of Colorectal Cancer. Gastroenterology. 2020; 158(2): 389-403. doi:10.1053/j.gastro.2019.11.029.
Biller LH and Schrag D. Diagnosis and Treatment of Metastatic Colorectal Cancer: A Review. Jama. 2021; 325(7): 669-85. doi:10.1001/jama.2021.0106.
Pećina-Šlaus N, Kafka A, Salamon I, Bukovac A. Mismatch Repair Pathway, Genome Stability and Cancer. Frontiers In Molecular Biosciences. 2020; 7: 122. doi:10.3389/fmolb.2020.00122.
Li K, Luo H, Huang L, Luo H, Zhu X. Microsatellite Instability: A Review of What the Oncologist Should Know. Cancer Cell International. 2020; 20: 1-13. doi:10.1186/s12935-019-1091-8.
Nolano A, Medugno A, Trombetti S, Liccardo R, De Rosa M, Izzo P, et al. Hereditary Colorectal Cancer: State of The Art in Lynch Syndrome. Cancers. 2022; 15(1): 75. doi:10.3390/cancers15010075.
Ahuja SK, Shrimankar DD, Durge AR. A Study and Analysis of Disease Identification Using Genomic Sequence Processing Models: An Empirical Review. Current Genomics. 2023; 24(4): 207. doi:10.2174/0113892029269523231101051455.
Qayyum MU, Sherani AMK, Khan M, Hussain HK. Revolutionizing Healthcare: The Transformative Impact of Artificial Intelligence in Medicine. Bin: Bulletin Of Informatics. 2023; 1(2): 71-83.
Iqbal MJ, Javed Z, Sadia H, Qureshi IA, Irshad A, Ahmed R, et al. Clinical Applications of Artificial Intelligence and Machine Learning in Cancer Diagnosis: Looking into The Future. Cancer Cell International. 2021; 21(1): 270. doi:10.1186/s12935-021-01981-1.
Pachouly J, Ahirrao S, Kotecha K, Selvachandran G, Abraham A. A Systematic Literature Review on Software Defect Prediction Using Artificial Intelligence: Datasets, Data Validation Methods, Approaches, And Tools. Engineering Applications for Artificial Intelligence. 2022; 111: 104773. doi:10.1016/j.engappai.2022.104773.
Singh R and Pal S. Machine Learning Algorithms and Ensemble Technique to Improve Prediction of Students Performance. International Journal of Advanced Trends in Computer Science and Engineering. 2020; 9(3). doi:10.30534/ijatcse/2020/221932020.
Rubio-Mangas D, García-Arranz M, Torres-Rodriguez Y, León-Arellano M, Suela J, García-Olmo D. Differential Presence of Exons (DPE): Sequencing Liquid Biopsy by Nges. A New Method for Clustering Colorectal Cancer Patients. BMC Cancer. 2023; 23(1): 1-14. doi:10.1186/s12885-022-10459-w.
Eldem V and Balcı MA. Mining NCBI Sequence Read Archive Database: An Untapped Source of Organelle Genomes for Taxonomic and Comparative Genomics Research. Diversity. 2024;16(2):104. doi:10.3390/d16020104.
Kurian B and Jyothi V. Breast Cancer Prediction Using an Optimal Machine Learning Technique for Next Generation Sequences. Concurrent Engineering. 2021; 29(1): 49-57. doi:10.1177/1063293X21991808.
Risal S, Zhu W, Guillen P, Sun L. Improving Phase Prediction Accuracy for High Entropy Alloys with Machine Learning. Computational Materials Science. 2021; 192: 110389. doi:10.1016/j.commatsci.2021.110389.
Ibrahim I and Abdulazeez A. The Role of Machine Learning Algorithms for Diagnosing Diseases. Journal Of Applied Science and Technology Trends. 2021; 2(01): 10-9. doi:10.38094/jastt20179.
Pellegrino E, Jacques C, Beaufils N, Nanni I, Carlioz A, Metellus P, et al. Machine Learning Random Forest for Predicting Oncosomatic Variant Ngs Analysis. Scientific Reports. 2021; 11(1): 21820. doi:10.1038/s41598-021-01253-y.
Lavanya C, Pooja S, Kashyap AH, Rahaman A, Niranjan S, Niranjan V. Novel Biomarker Prediction for Lung Cancer Using Random Forest Classifiers. Cancer Informatics. 2023; 22. doi:10.1177/11769351231167992.
Qi Y. Random Forest for Bioinformatics. Ensemble Machine Learning: Methods And Applications. 2012 307-23. doi:10.1007/978-1-4419-9326-7_11.
Sun Z, Wang G, Li P, Wang H, Zhang M, Liang X. An Improved Random Forest Based on The Classification Accuracy and Correlation Measurement of Decision Trees. Expert Systems with Applications. 2024; 237: 121549. doi:10.1016/j.eswa.2023.121549.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Futuristic Biotechnology

This work is licensed under a Creative Commons Attribution 4.0 International License.
This is an open-access journal and all the published articles / items are distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. For comments editor@fbtjournal.com