Predicting breast cancer recurrence using principal component analysis as feature extraction: an unbiased comparative analysis

(1) * Zuhaira Muhammad Zain Mail (College of Computer and Information Sciences, Princess Nourah Bint Abdulrahman University, Saudi Arabia)
(2) Mona Alshenaifi Mail (College of Computer and Information Sciences, Princess Nourah Bint Abdulrahman University, Saudi Arabia)
(3) Abeer Aljaloud Mail (College of Computer and Information Sciences, Princess Nourah Bint Abdulrahman University, Saudi Arabia)
(4) Tamadhur Albednah Mail (College of Computer and Information Sciences, Princess Nourah Bint Abdulrahman University, Saudi Arabia)
(5) Reham Alghanim Mail (College of Computer and Information Sciences, Princess Nourah Bint Abdulrahman University, Saudi Arabia)
(6) Alanoud Alqifari Mail (College of Computer and Information Sciences, Princess Nourah Bint Abdulrahman University, Saudi Arabia)
(7) Amal Alqahtani Mail (College of Computer and Information Sciences, Princess Nourah Bint Abdulrahman University, Saudi Arabia)
*corresponding author


Breast cancer recurrence is among the most noteworthy fears faced by women. Nevertheless, with modern innovations in data mining technology, early recurrence prediction can help relieve these fears. Although medical information is typically complicated, and simplifying searches to the most relevant input is challenging, new sophisticated data mining techniques promise accurate predictions from high-dimensional data. In this study, the performances of three established data mining algorithms: Naïve Bayes (NB), k-nearest neighbor (KNN), and fast decision tree (REPTree), adopting the feature extraction algorithm, principal component analysis (PCA), for predicting breast cancer recurrence were contrasted. The comparison was conducted between models built in the absence and presence of PCA. The results showed that KNN produced better prediction without PCA (F-measure = 72.1%), whereas the other two techniques: NB and REPTree, improved when used with PCA (F-measure = 76.1% and 72.8%, respectively). This study can benefit the healthcare industry in assisting physicians in predicting breast cancer recurrence precisely.


Breast cancer recurrence; Data Mining; Feature Extraction; Machine Learning; Principal Component Analysis



Article metrics

Abstract views : 947 | PDF views : 74




Full Text



[1] World Health Organization (WHO), “Breast cancer,” 2020. [Online]. Available: [Accessed: 30-Oct-2020].

[2] H. Pan et al., “20-Year Risks of Breast-Cancer Recurrence after Stopping Endocrine Therapy at 5 Years,” N. Engl. J. Med., vol. 377, no. 19, pp. 1836–1846, Nov. 2017, doi: 10.1056/NEJMoa1701830.

[3] A. Bhardwaj and A. Tiwari, “Breast cancer diagnosis using Genetically Optimized Neural Network model,” Expert Syst. Appl., vol. 42, no. 10, pp. 4611–4620, Jun. 2015, doi: 10.1016/j.eswa.2015.01.065.

[4] M. Seera and C. P. Lim, “A hybrid intelligent system for medical data classification,” Expert Syst. Appl., vol. 41, no. 5, pp. 2239–2249, Apr. 2014, doi: 10.1016/j.eswa.2013.09.022.

[5] W.-C. Yeh, W.-W. Chang, and Y. Y. Chung, “A new hybrid approach for mining breast cancer pattern using discrete particle swarm optimization and statistical method,” Expert Syst. Appl., vol. 36, no. 4, pp. 8204–8211, May 2009, doi: 10.1016/j.eswa.2008.10.004.

[6] B. Zheng, S. W. Yoon, and S. S. Lam, “Breast cancer diagnosis based on feature extraction using a hybrid of K-means and support vector machine algorithms,” Expert Syst. Appl., vol. 41, no. 4, pp. 1476–1482, Mar. 2014, doi: 10.1016/j.eswa.2013.08.044.

[7] I. M. D. Maysanjaya, I. M. A. Pradnyana, and I. M. Putrama, “Classification of breast cancer using Wrapper and Naïve Bayes algorithms,” J. Phys. Conf. Ser., vol. 1040, p. 012017, Jun. 2018, doi: 10.1088/1742-6596/1040/1/012017.

[8] L. Yang and Z. Xu, “Feature extraction by PCA and diagnosis of breast tumors using SVM with DE-based parameter tuning,” Int. J. Mach. Learn. Cybern., vol. 10, no. 3, pp. 591–601, Mar. 2019, doi: 10.1007/s13042-017-0741-1.

[9] S. A. Kumaraswamy and R. Mallika, “Cancer Classification in Microarray Data Using Gene Expression with KNN and FNN,” Int. J. Adv. Res. Comput. Sci., vol. 2, no. 5, 2011, doi: 10.26483/ijarcs.v2i5.722

[10] N. Sharma and H. Om, “Data mining models for predicting oral cancer survivability,” Netw. Model. Anal. Heal. Informatics Bioinforma., vol. 2, no. 4, pp. 285–295, Dec. 2013, doi: 10.1007/s13721-013-0045-7.

[11] J. Thongkam, G. Xu, Y. Zhang, and F. Huang, “Toward breast cancer survivability prediction models through improving training space,” Expert Syst. Appl., vol. 36, no. 10, pp. 12200–12209, Dec. 2009, doi: 10.1016/j.eswa.2009.04.067.

[12] C.-H. Jen, C.-C. Wang, B. C. Jiang, Y.-H. Chu, and M.-S. Chen, “Application of classification techniques on development an early-warning system for chronic illnesses,” Expert Syst. Appl., vol. 39, no. 10, pp. 8852–8858, Aug. 2012, doi: 10.1016/j.eswa.2012.02.004.

[13] S. J, “Designing a Cloud Based Framework for Enhancing the Performance of Diabetic Classification Using Naïve Bayes Classifier,” Int. J. Adv. Res. Comput. Sci., vol. 8, no. 9, pp. 723–726, Sep. 2017, doi: 10.26483/ijarcs.v8i9.5204.

[14] J.-Y. Yeh, T.-H. Wu, and C.-W. Tsao, “Using data mining techniques to predict hospitalization of hemodialysis patients,” Decis. Support Syst., vol. 50, no. 2, pp. 439–448, Jan. 2011, doi: 10.1016/j.dss.2010.11.001.

[15] M. M. Kirmani and S. I. Ansarullah, “Classification models on cardiovascular disease detection using Neural Networks, Naïve Bayes and J48 Data Mining Techniques.,” Int. J. Adv. Res. Comput. Sci., vol. 7, no. 5, 2016, Available at: Google Scholar

[16] S. Fei, “Diagnostic study on arrhythmia cordis based on particle swarm optimization-based support vector machine,” Expert Syst. Appl., vol. 37, no. 10, pp. 6748–6752, Oct. 2010, doi: 10.1016/j.eswa.2010.02.126.

[17] K. Kourou, T. P. Exarchos, K. P. Exarchos, M. V. Karamouzis, and D. I. Fotiadis, “Machine learning applications in cancer prognosis and prediction,” Comput. Struct. Biotechnol. J., vol. 13, pp. 8–17, 2015, doi: 10.1016/j.csbj.2014.11.005.

[18] N. M. P. Trushna Patel, Darshak G Thakore, “A Survey on Object Detection Based Automatic Image Captioning using Deep Learning,” Int. J. Mod. Trends Sci. Technol., vol. 6, no. 4, pp. 274–280, 2020, Available at:

[19] A. Jamal, A. Handayani, A. A. Septiandri, E. Ripmiatin, and Y. Effendi, “Dimensionality Reduction using PCA and K-Means Clustering for Breast Cancer Prediction,” Lontar Komput. J. Ilm. Teknol. Inf., p. 192, Dec. 2018, doi: 10.24843/LKJITI.2018.v09.i03.p08.

[20] J. Verma, M. Nath, P. Tripathi, and K. K. Saini, “Analysis and identification of kidney stone using Kth nearest neighbour (KNN) and support vector machine (SVM) classification techniques,” Pattern Recognit. Image Anal., vol. 27, no. 3, pp. 574–580, Jul. 2017, doi: 10.1134/S1054661817030294.

[21] W. N. H. W. Mohamed, M. N. M. Salleh, and A. H. Omar, “A comparative study of Reduced Error Pruning method in decision tree algorithms,” in 2012 IEEE International Conference on Control System, Computing and Engineering, 2012, pp. 392–397, doi: 10.1109/ICCSCE.2012.6487177.

[22] C. J. C. Burges, “Dimension Reduction: A Guided Tour,” Found. Trends® Mach. Learn., vol. 2, no. 4, pp. 275–364, 2009, doi: 10.1561/2200000002.

[23] Tsang-Hsiang Cheng, Chih-Ping Wei, and V. S. Tseng, “Feature Selection for Medical Data Mining: Comparisons of Expert Judgment and Automatic Approaches,” in 19th IEEE Symposium on Computer-Based Medical Systems (CBMS’06), 2006, pp. 165–170, doi: 10.1109/CBMS.2006.87.

[24] G. Pfurtscheller et al., “Graz-BCI: state of the art and clinical applications,” IEEE Trans. Neural Syst. Rehabil. Eng., vol. 11, no. 2, pp. 1–4, Jun. 2003, doi: 10.1109/TNSRE.2003.814454.

[25] H. Hasan and N. M. Tahir, “Feature selection of breast cancer based on Principal Component Analysis,” in 2010 6th International Colloquium on Signal Processing & its Applications, 2010, pp. 1–4, doi: 10.1109/CSPA.2010.5545298.

[26] S. Jhajharia, H. K. Varshney, S. Verma, and R. Kumar, “A neural network based breast cancer prognosis model with PCA processed features,” in 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI), 2016, pp. 1896–1901, doi: 10.1109/ICACCI.2016.7732327.

[27] M. S. Uzer, O. Inan, and N. Yılmaz, “A hybrid breast cancer detection system via neural network and feature selection based on SBS, SFS and PCA,” Neural Comput. Appl., vol. 23, no. 3–4, pp. 719–728, Sep. 2013, doi: 10.1007/s00521-012-0982-6.

[28] K. Bian, M. Zhou, F. Hu, and W. Lai, “RF-PCA: A New Solution for Rapid Identification of Breast Cancer Categorical Data Based on Attribute Selection and Feature Extraction,” Front. Genet., vol. 11, Sep. 2020, doi: 10.3389/fgene.2020.566057.

[29] R. H and A. T, “Feature Extraction of Chest X-ray Images and Analysis using PCA and kPCA,” Int. J. Electr. Comput. Eng., vol. 8, no. 5, p. 3392, Oct. 2018, doi: 10.11591/ijece.v8i5.pp3392-3398.

[30] S. Ray, “6 Easy Steps to Learn Naive Bayes Algorithm,” 2017. [Online]. Available: analyticsvidhya [Accessed: 04-Jan-2020].

[31] F. Provost and R. Kohavi, “Guest editors’ introduction: On applied research in machine learning,” Mach. Learn., vol. 30, no. 2–3, pp. 127–132, 1998, doi: 10.1023/A:1007442505281.

[32] I. H. Witten, E. Frank, M. A. Hall, and C. J. Pal, “Data Mining: Practical Machine Learning Tools and Techniques,” in Data Mining, Elsevier, 2017, pp. 417–466, doi: 10.1016/B978-0-12-804291-5.00010-6

[33] J. R. Quinlan, “Simplifying decision trees,” Int. J. Man. Mach. Stud., vol. 27, no. 3, pp. 221–234, Sep. 1987, doi: 10.1016/S0020-7373(87)80053-6.

[34] “University of California Irvine Machine Learning Repository.” [Online]. Available: [Accessed: 01-Aug-2019].

[35] J. R. Landis and G. G. Koch, “The Measurement of Observer Agreement for Categorical Data,” Biometrics, vol. 33, no. 1, p. 159, Mar. 1977, doi: 10.2307/2529310.

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

International Journal of Advances in Intelligent Informatics
ISSN 2442-6571  (print) | 2548-3161 (online)
Organized by Informatics Department - Universitas Ahmad Dahlan,  UTM Big Data Centre - Universiti Teknologi Malaysia, and ASCEE Computer Society
Published by Universitas Ahmad Dahlan
E: (paper handling issues), (publication issues)

View IJAIN Stats

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0