Analyzing risk factors and handling imbalanced data for predicting stroke risk using machine learning

Adiwijaya Adiwijaya; Nur Ghaniaviyanto Ramadhan

doi:10.26555/ijain.v11i1.1678


Analyzing risk factors and handling imbalanced data for predicting stroke risk using machine learning

^{(1) *} Adiwijaya Adiwijaya

(School of Computing, Telkom University, Indonesia)
⁽²⁾ Nur Ghaniaviyanto Ramadhan

(School of Computing, Telkom University, Indonesia)
^*corresponding author

Abstract

Stroke is a serious medical condition resulting from disturbances in blood flow to the brain, signaling a chronic health issue that requires an immediate response. Principal risk factors increasing the likelihood of stroke include the presence of pre-existing conditions such as Diabetes Mellitus (DM), hypertension, and high cholesterol levels. Effective preventive measures are crucial to minimize stroke risk, and using predictive methods based on data analysis from the clinical examination dataset over the last three years (2019-2021), known as the general checkup (GCU) dataset, presents an innovative approach. This study aims to predict an individual's stroke risk for the following year. In this context, the study also addresses the preprocessing stage of the GCU dataset, which includes solutions for missing values by substituting them with the statistical mean, label encoding, feature correlation analysis using entropy values, and addressing data imbalance with the Adaptive Synthetic (ADASYN) technique. To evaluate their predictive performance, the research involves comparisons among various machine learning models. The outcome of the experiment shows that the Random Forest model is the best model, with 98.7% accuracy and 63.9% F1-Score. This research highlights the importance of preemptive measures against stroke by utilizing predictive techniques on clinical data, with the Random Forest model proving most effective in forecasting stroke probability.

Keywords

General checkup data; Machine learning; Stroke prediction; Adasyn; Random forest

DOI

https://doi.org/10.26555/ijain.v11i1.1678

Article metrics

Abstract views : 1060 | PDF views : 297

[2] Balitbangkes, “National Riskesdas Report 2018,” Lembaga Penerbit Balitbangkes. p. hal 156, 2018, [Online]. Available at: https://repository.badankebijakan.kemkes.go.id/id/eprint/3514/1/Laporan Riskesdas 2018 Nasional.pdf.

[3] M. U. Emon, M. S. Keya, T. I. Meghla, M. M. Rahman, M. S. Al Mamun, and M. S. Kaiser, “Performance Analysis of Machine Learning Approaches in Stroke Prediction,” in 2020 4th International Conference on Electronics, Communication and Aerospace Technology (ICECA), Nov. 2020, pp. 1464–1469, doi: 10.1109/ICECA49313.2020.9297525.

[4] G. Sailasya and G. L. A. Kumari, “Analyzing the Performance of Stroke Prediction using ML Classification Algorithms,” Int. J. Adv. Comput. Sci. Appl., vol. 12, no. 6, pp. 539–545, 2021, doi: 10.14569/IJACSA.2021.0120662.

[5] M. Kaur, S. R. Sakhare, K. Wanjale, and F. Akter, “Early Stroke Prediction Methods for Prevention of Strokes,” Behav. Neurol., vol. 2022, no. 1, pp. 1–9, Apr. 2022, doi: 10.1155/2022/7725597.

[6] A. Alshammari, N. Atiyah, H. Alaboodi, and R. Alshammari, “Identification of stroke using deepnet machine learning algorithm,” Int. J. Med. Eng. Inform., vol. 15, no. 5, pp. 416–429, 2023, doi: 10.1504/IJMEI.2023.133083.

[7] E. Dritsas and M. Trigka, “Stroke Risk Prediction with Machine Learning Techniques,” Sensors, vol. 22, no. 13, p. 4670, Jun. 2022, doi: 10.3390/s22134670.

[8] C. Kokkotis et al., “An Explainable Machine Learning Pipeline for Stroke Prediction on Imbalanced Data,” Diagnostics, vol. 12, no. 10, p. 2392, Oct. 2022, doi: 10.3390/diagnostics12102392.

[9] L. I. Santos et al., “Decision tree and artificial immune systems for stroke prediction in imbalanced data,” Expert Syst. Appl., vol. 191, p. 116221, Apr. 2022, doi: 10.1016/j.eswa.2021.116221.

[10] M. R. Thanka, K. S. Ram, S. P. Gandu, E. B. Edwin, V. Ebenezer, and P. Joy, “Comparing Resampling Techniques in Stroke Prediction with Machine and Deep Learning,” in 2023 International Conference on Sustainable Computing and Smart Systems (ICSCSS), Jun. 2023, pp. 1415–1420, doi: 10.1109/ICSCSS57650.2023.10169237.

[11] M. Dahiya, N. Mishra, S. Agarwal, and Z. Parveen, “Predicting the occurrence of Ischemic stroke by Gradient Boost Approaches,” in 2023 4th International Conference on Intelligent Engineering and Management (ICIEM), May 2023, pp. 1–4, doi: 10.1109/ICIEM59379.2023.10166287.

[12] S. D. Abdullahi and S. A. Muhammad, “Early Prediction of Cerebrovascular Disease using Boosting Machine Learning Algorithms to Assist Clinicians,” J. Appl. Sci. Environ. Manag., vol. 26, no. 6, pp. 1031–1037, Jun. 2022, doi: 10.4314/jasem.v26i6.6.

[13] Y. Wu and Y. Fang, “Stroke Prediction with Machine Learning Methods among Older Chinese,” Int. J. Environ. Res. Public Health, vol. 17, no. 6, p. 1828, Mar. 2020, doi: 10.3390/ijerph17061828.

[14] S. Rahman, M. Hasan, and A. K. Sarkar, “Prediction of Brain Stroke using Machine Learning Algorithms and Deep Neural Network Techniques,” Eur. J. Electr. Eng. Comput. Sci., vol. 7, no. 1, pp. 23–30, Jan. 2023, doi: 10.24018/ejece.2023.7.1.483.

[15] X. Li, D. Bian, J. Yu, M. Li, and D. Zhao, “Using machine learning models to improve stroke risk level classification methods of China national stroke screening,” BMC Med. Inform. Decis. Mak., vol. 19, no. 1, p. 261, Dec. 2019, doi: 10.1186/s12911-019-0998-2.

[16] T. Liu, W. Fan, and C. Wu, “A hybrid machine learning approach to cerebral stroke prediction based on imbalanced medical dataset,” Artif. Intell. Med., vol. 101, p. 101723, Nov. 2019, doi: 10.1016/j.artmed.2019.101723.

[17] P. Govindarajan, R. K. Soundarapandian, A. H. Gandomi, R. Patan, P. Jayaraman, and R. Manikandan, “Classification of stroke disease using machine learning algorithms,” Neural Comput. Appl., vol. 32, no. 3, pp. 817–828, Feb. 2020, doi: 10.1007/s00521-019-04041-y.

[18] G. Fang, W. Liu, and L. Wang, “A machine learning approach to select features important to stroke prognosis,” Comput. Biol. Chem., vol. 88, p. 107316, Oct. 2020, doi: 10.1016/j.compbiolchem.2020.107316.

[19] T. Tazin, M. N. Alam, N. N. Dola, M. S. Bari, S. Bourouis, and M. Monirujjaman Khan, “Stroke Disease Detection and Prediction Using Robust Learning Approaches,” J. Healthc. Eng., vol. 2021, no. 1, pp. 1–12, Nov. 2021, doi: 10.1155/2021/7633381.

[20] C. Fernandez-Lozano et al., “Random forest-based prediction of stroke outcome,” Sci. Rep., vol. 11, no. 1, p. 10071, May 2021, doi: 10.1038/s41598-021-89434-7.

[21] A. A. Gozali, “Hypertension Multi-Year Prediction and Risk Factors Analysis Using Decision Tree,” in 2023 10th International Conference on Information Technology, Computer, and Electrical Engineering (ICITACEE), Aug. 2023, pp. 76–82, doi: 10.1109/ICITACEE58587.2023.10277644.

[22] N. G. Ramadhan, Adiwijaya, W. Maharani, and A. A. Gozali, “Prediction of Diabetes Mellitus in the Upcoming Year using SMOTE and Random Forest,” in 2023 International Conference on Data Science and Its Applications (ICoDSA), Aug. 2023, pp. 316–321, doi: 10.1109/ICoDSA58501.2023.10277223.

[23] A. Uddin, X. Tao, C.-C. Chou, and D. Yu, “Are missing values important for earnings forecasts? A machine learning perspective,” Quant. Financ., vol. 22, no. 6, pp. 1113–1132, Jun. 2022, doi: 10.1080/14697688.2021.1963825.

[24] A. Purwar and S. K. Singh, “Hybrid prediction model with missing value imputation for medical data,” Expert Syst. Appl., vol. 42, no. 13, pp. 5621–5631, Aug. 2015, doi: 10.1016/j.eswa.2015.02.050.

[25] S. Dev, H. Wang, C. S. Nwosu, N. Jain, B. Veeravalli, and D. John, “A predictive analytics approach for stroke prediction using machine learning and neural networks,” Healthc. Anal., vol. 2, p. 100032, Nov. 2022, doi: 10.1016/j.health.2022.100032.

[26] K. Patidar, R. K. Gour, A. Dixit, M. Verma, and A. K. Pal, “An Improved Method for the Data Cluster Based Feature Selection and Classification,” in 2023 International Conference for Advancement in Technology (ICONAT), Jan. 2023, pp. 1–6, doi: 10.1109/ICONAT57137.2023.10080669.

[27] S. Buyruko?lu and A. AKBA?, “Machine Learning based Early Prediction of Type 2 Diabetes: A New Hybrid Feature Selection Approach using Correlation Matrix with Heatmap and SFS,” Balk. J. Electr. Comput. Eng., vol. 10, no. 2, pp. 110–117, Apr. 2022, doi: 10.17694/bajece.973129.

[28] F. Viton, M. Elbattah, J.-L. Guerin, and G. Dequen, “Heatmaps for Visual Explainability of CNN-Based Predictions for Multivariate Time Series with Application to Healthcare,” in 2020 IEEE International Conference on Healthcare Informatics (ICHI), Nov. 2020, pp. 1–8, doi: 10.1109/ICHI48887.2020.9374393.

[29] N. G. Ramadhan, A. -, and A. Romadhony, “Preprocessing Handling to Enhance Detection of Type 2 Diabetes Mellitus based on Random Forest,” Int. J. Adv. Comput. Sci. Appl., vol. 12, no. 7, pp. 223–228, Sep. 2021, doi: 10.14569/IJACSA.2021.0120726.

[30] M. Zakariah, S. A. AlQahtani, and M. S. Al-Rakhami, “Machine Learning-Based Adaptive Synthetic Sampling Technique for Intrusion Detection,” Appl. Sci., vol. 13, no. 11, p. 6504, May 2023, doi: 10.3390/app13116504.

[31] M. Mujahid et al., “Data oversampling and imbalanced datasets: an investigation of performance for machine learning and feature engineering,” J. Big Data, vol. 11, no. 1, p. 87, Jun. 2024, doi: 10.1186/s40537-024-00943-4.

[32] S. A. Alex, J. Jesu Vedha Nayahi, and S. Kaddoura, “Deep convolutional neural networks with genetic algorithm-based synthetic minority over-sampling technique for improved imbalanced data classification,” Appl. Soft Comput., vol. 156, p. 111491, May 2024, doi: 10.1016/j.asoc.2024.111491.

[33] R. M. Munshi, “Novel ensemble learning approach with SVM-imputed ADASYN features for enhanced cervical cancer prediction,” PLoS One, vol. 19, no. 1, p. e0296107, Jan. 2024, doi: 10.1371/journal.pone.0296107.

[34] P. Gnip, L. Vokorokos, and P. Drotár, “Selective oversampling approach for strongly imbalanced data,” PeerJ Comput. Sci., vol. 7, p. e604, Jun. 2021, doi: 10.7717/peerj-cs.604.

[35] A. Fernández, S. García, M. Galar, R. C. Prati, B. Krawczyk, and F. Herrera, “Introduction to KDD and Data Science,” in Learning from Imbalanced Data Sets, Cham: Springer International Publishing, 2018, pp. 1–17, doi: 10.1007/978-3-319-98074-4_1.

[36] S. Rana, R. Kanji, and S. Jain, “Comprehensive Analysis of Oversampling Techniques for Addressing Class Imbalance Employing Machine Learning Models,” Recent Adv. Comput. Sci. Commun., vol. 18, p. 95 , Dec. 2024, doi: 10.2174/0126662558347788241127051934.

[37] A. Balaram and S. Vasundra, “Prediction of software fault-prone classes using ensemble random forest with adaptive synthetic sampling algorithm,” Autom. Softw. Eng., vol. 29, no. 1, p. 6, May 2022, doi: 10.1007/s10515-021-00311-z.

[38] Haibo He, Yang Bai, E. A. Garcia, and Shutao Li, “ADASYN: Adaptive synthetic sampling approach for imbalanced learning,” in 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Jun. 2008, pp. 1322–1328, doi: 10.1109/IJCNN.2008.4633969.

[39] J.-B. Wang, C.-A. Zou, and G.-H. Fu, “AWSMOTE: An SVM-Based Adaptive Weighted SMOTE for Class-Imbalance Learning,” Sci. Program., vol. 2021, no. 1, pp. 1–18, May 2021, doi: 10.1155/2021/9947621.

[40] Q. Wang, W. Cao, J. Guo, J. Ren, Y. Cheng, and D. N. Davis, “DMP_MI: An Effective Diabetes Mellitus Classification Algorithm on Imbalanced Data With Missing Values,” IEEE Access, vol. 7, pp. 102232–102238, 2019, doi: 10.1109/ACCESS.2019.2929866.

[41] L. Breiman, “Random Forests,” Mach. Learn. 2001 451, vol. 45, no. 1, pp. 5–32, Oct. 2001, doi: 10.1023/A:1010933404324.

[42] G. Alfian et al., “Customer Shopping Behavior Analysis Using RFID and Machine Learning Models,” Information, vol. 14, no. 10, p. 551, Oct. 2023, doi: 10.3390/info14100551.

[43] A. A. Gozali, “Multi-Years Diabetes Prediction Using Machine Learning and General Check-Up Dataset,” in 2023 11th International Conference on Information and Communication Technology (ICoICT), Aug. 2023, vol. 2023-Augus, pp. 98–103, doi: 10.1109/ICoICT58202.2023.10262699.

[44] S.-C. Chang et al., “The Comparison and Interpretation of Machine-Learning Models in Post-Stroke Functional Outcome Prediction,” Diagnostics, vol. 11, no. 10, p. 1784, Sep. 2021, doi: 10.3390/diagnostics11101784.

[45] N. Komal Kumar, D. Vigneswari, M. Vamsi Krishna, and G. V. Phanindra Reddy, “An Optimized Random Forest Classifier for Diabetes Mellitus,” in Advances in Intelligent Systems and Computing, vol. 813, Springer, Singapore, 2019, pp. 765–773, doi: 10.1007/978-981-13-1498-8_67.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

___________________________________________________________
International Journal of Advances in Intelligent Informatics
ISSN 2442-6571 (print) | 2548-3161 (online)
Organized by UAD and ASCEE Computer Society
Published by Universitas Ahmad Dahlan
W: http://ijain.org
E: info@ijain.org (paper handling issues)
andri.pranolo.id@ieee.org (publication issues)

View IJAIN Stats

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0

Username
Password
Remember me