Predicting breast cancer recurrence using principal component analysis as feature extraction: an unbiased comparative analysis

Article history Received January 29, 2020 Revised October 29, 2020 Accepted November 6, 2020 Available online November 30, 2020 Breast cancer recurrence is among the most noteworthy fears faced by women. Nevertheless, with modern innovations in data mining technology, early recurrence prediction can help relieve these fears. Although medical information is typically complicated, and simplifying searches to the most relevant input is challenging, new sophisticated data mining techniques promise accurate predictions from high-dimensional data. In this study, the performances of three established data mining algorithms: Naïve Bayes (NB), k-nearest neighbor (KNN), and fast decision tree (REPTree), adopting the feature extraction algorithm, principal component analysis (PCA), for predicting breast cancer recurrence were contrasted. The comparison was conducted between models built in the absence and presence of PCA. The results showed that KNN produced better prediction without PCA (F-measure = 72.1%), whereas the other two techniques: NB and REPTree, improved when used with PCA (F-measure = 76.1% and 72.8%, respectively). This study can benefit the healthcare industry in assisting physicians in predicting breast cancer recurrence precisely.


Introduction
Breast cancer is the most prevalent cancer among women, affecting 2.1 million women per year. It also causes the largest number of cancer-related deaths among women. The World Health Organization (WHO) also announced that an estimated 627,000 people died of breast cancer in 2018, around 15% of all cancer deaths among women. While breast cancer rates are higher among women in more developed countries, rates rise in almost every region of the world [1].
Breast cancer recurrence is one of the biggest challenges a patient has to face and is one of the issues that impact their living standards. Breast cancer recurrence refers to breast cancer reoccurring in a woman whose former cancer was remediated. According to Pan et al. [2] even 20 years after a diagnosis, women with a type of breast cancer fueled by estrogen still face a substantial risk of cancer returning or spreading. The prediction is challenging because the recurrence data is rarely recorded in most breast cancer datasets. An accurate and timely prediction is essential because it helps physicians make a decision and supports more personalized patient therapy.
Knowledge discovery in databases (KDD) methods offer a stimulating prospect to scrutinize this kind of problems driven by data. Data mining, an important KDD subset, remains an iterative procedure in the hunt for current, useful, as well as critical data in enormous data amounts, also called highdimensional data [3]. Multiple health-related illnesses, key among them being breast cancer [4]- [11] diabetes [4][12] [13], and oral cancer [12], in addition to cardiovascular diseases [13]- [16], have been effectively diagnosed and predicted by utilizing data mining, as well as machine learning procedures. These successful studies encourage the data mining application in predicting breast tumor recurrence, therefore driving the foundation of this study.
In the last few years, different types of high-dimensional information have been generated by developing high-throughput technologies, especially those associated with the manifestation of disease and the control of tumor recurrence. It is a challenge to get insights from high-dimensional data. Highdimensional data have to be transformed into low-dimensional data by operating reduction techniques. Dimensionality reduction enables high-dimensional data to be classified, visualized, communicated and stored.
The medical data dimensions contain a number of features, and every feature comprises various types of values. Data quality problems consist of missing or redundant data, outliers, noise, and biased or unrepresentative data entries [17]. To focus on data preparation, preprocessing stages should be used to increase the suitability of raw data for analysis. Additionally, medical data entries are usually complex and suffer from the challenge of high dimensionality. It is difficult to reduce the dataset used in the prediction manually, but a feature extraction technique can be used to solve this. Some popular feature extraction techniques include principal component analysis (PCA), independent component analysis, linear discriminant analysis, locally linear embedding, t-distributed stochastic neighbor embedding, and autoencoders [18]. Among them, PCA is widely used in breast cancer prediction [19]. Besides, it is the most appropriate approach that can be applied when there is a need to minimize the number of variables. However, it cannot specify which variable to keep in consideration. Also, PCA works best on datasets with three or higher dimensions of numeric variables.
Moreover, it aims to reduce feature dimension by capturing as much information as possible with high explained variance and minimizing information loss at the same time. With emerging techniques in data mining, the production of accurate predictions is promising. However, feature extraction alone is not sufficient to predict breast cancer recurrence.
Some classification algorithms such as K-nearest neighbors (KNN), Naïve Bayes (NB), and fast decision tree (REPTree) need to be applied to classify whether patients have breast cancer recurrence or not. These classifiers have been used for many healthcare data prediction [7][9] [13] [15][20] [21]. However, the three popular classifiers have not been combined with PCA as feature extraction in predicting breast cancer recurrence. Another issue in machine learning studies is regarding the performance metrics used. Most of the studies usually used accuracy on the models' performance evaluation while there are many more performance metrics that can be used to measure the performance of machine learning classifiers like incorrectly classified instances, Cohen's kappa, recall, precision, and F-measure.
This study proposed a PCA technique to reduce the Wisconsin Prognostic Breast Cancer dataset's high dimensionality to tackle the aforementioned drawbacks. KNN, NB, and REPTree were used as classifiers in the prediction models. Performance metrics, such as incorrectly classified instances, Cohen's kappa, recall, precision, and F-measure, were applied in addition to accuracy during the comparative analysis to evaluate the distinction between the performance demonstrated by PCA models and non-PCA models.
Additional sections on the manuscript are organized accordingly. Section 2 defines the PCA feature extraction, NB, KNN, and REPTree classifiers used in this study. Then, Section 3 clarifies every phase of the research methodology. Section 4 examines the findings of this research. Finally, Section 5 sets out the conclusions and emphasizes the extent of future activities.

Feature Extraction Using PCA
Feature extraction is the procedure by which irrelevant, less relevant, or redundant dimensional attributes are identified and disregarded within a given dataset [22] [23] that transforms data in highdimensional space to less-dimensional space. These methods usually are denoted as preprocess to machine learning algorithms (MLA) for pattern recognition and prediction [24]. PCA is one of the feature extraction approaches.
Using PCA makes it possible to reduce the number of variables in a multivariate dataset, preserving as much variation as possible in the dataset. Such minimization is accomplished through the employment of distinct p variables, namely, T 1 , T 2 , T 3 . . .,T p and finding the groupings of the variables to generate uncorrelated principal elements (PCs) PC 1 , PC 2 , PC3 . . ., PC p . The aforementioned PCs are also known as eigenvectors. Notably, correlation deficiencies make up an invaluable property because it indicates that different "dimensions" are computed within the data through the PC. However, PCs are arranged in such a way that PC 1 shows the greatest variation, while PC 2 shows the second greatest variation, and the subsequent PCs reduce their variation uniformly. Basically, var(PC 1 ) is greater than or equal to var(PC 2 ), var(PC 2 ) is greater than or equal to var(PC 3 ), and var(PC 3 ) is greater than or equal to var(PC p ). In this scenario, var(PC i ) represents the PC i variation within the relevant dataset. Meanwhile, var(PC i ) may be denoted as PC i 's eigenvalue.
The PCA algorithm starts with calculating the mean for each feature. The mean value is then subtracted from the original data to the new centralized data, and it decomposes the covariance matrix of the data. Afterward, the covariance matrix of data points is calculated, and its eigenvectors and corresponding eigenvalues are solved. Next, the eigenvectors, according to their eigenvalues, are sorted in decreasing order. Choosing the first k (number of components) eigenvectors will yield the new k dimensions. Finally, PCA would transform the original dimensional data points into the new reduced dimensions.
Several studies have utilized PCA as the feature extraction method on healthcare data, especially the Wisconsin Breast Cancer dataset. For example, in [8], PCA was combined with a differential evolution support vector machine to improve the cancer detection ability with 97.64% accuracy. Hasan and Tahir [25] applied PCA as feature extraction and the artificial neural network as a classifier to enhance benign or malignant classification. Their method was found to discriminate between normal and breast cancer patients with 95.68% testing accuracy. Jamal et al. [19] implemented PCA with Support Vector Machine and Extreme Gradient Boosting in predicting breast cancer. Jhajharia et al. [26] conducted a study where PCA is applied together with artificial neural networks with 98.39% accuracy. Uzer et al. [27] conducted another successful study on breast cancer prediction. First, they selected important features by using sequential forward selection (SFSP) and sequential backward selection (SBSP) algorithms. The selected features from both algorithms were then fed to the PCA to reduce the dimensionality.
The new feature set was then used as an input for the neural network classifier. Their study achieved 98.57% and 97.57% accuracy for SFSP and SBSP, respectively. A recent study conducted by Bian et al. [28] proposed a new breast cancer prediction approach. They employed random forest as a feature selection to select a set of important features. These features were passed to PCA to reduce data dimensionality. The new feature set of seven principal components was finally fed to the extreme learning machine classification model with different activation functions. Their proposed model achieved 98.75% accuracy. Another study conducted by Roopa and Asha [29] achieved 96.07% accuracy using PCA with wrapper and linear regression algorithms in tuberculosis diagnosis. All of these studies show that applying PCA reduces the dimension of the dataset and increases the performance of the classifiers. However, none of these studies tested PCA with the three famous classifiers, namely, NB, REPTree, and KNN, in breast cancer recurrence detection.

Classification Algorithms
This subsection describes the three well-known classification algorithms used in healthcare: NB, REPTree, and KNN.

NB
NB denotes a Bayes' theorem classification method that contains a supposition of autonomy between the classification algorithms. This classifier supposes that there is no connection between a particular feature and the presence of any other feature within a class. It is worth noting that the model is invariably easy to develop and is very capable of enormous datasets. NB is known for its simplicity and outstanding classification processes [30]. It also performs very well in multiclass predictions, as an easy and fast predictor of single class test sets. If the independence assumption is retained, an NB classifier performs better than logistic regression and takes less training data. Besides, it performs well in categorical input compared with the normalized bell curve. Several studies have been carried out using the NB algorithm on healthcare data [7][13] [15], and findings confirm that it is a good classifier in predicting healthcarerelated cases.

REPTree
The REPTree classifier denotes a fast decision tree learning system constructed from the concept of calculating the data gain with entropy and reducing errors resulting from variance [31]. It was proposed in 2011 [32]. It uses the logic of a regression tree and produces several trees in modified iterations. The best tree of the spawned trees will then be selected. This algorithm uses variance and information gain to build the regression/decision tree. Further, this algorithm uses a back-fitting method to prune the tree with reduced error pruning. It sorts numerical attribute values once at the start of the preparation of the model. This algorithm also addresses missing values, as in the C4.5 algorithm, by dividing the corresponding scenarios into pieces [33]. This study [21] reported that REPTree performs well in classifying healthcare data.

KNN
The KNNs denote a supervised classifier that selects the k nearest neighbor associated with a particular point by minimizing a similarity measure, the Mahalanobis distance or Euclidean distance [20]. KNN calculates its closeness to the outstanding (labeled) instances and establishes its k-nearest neighbor and their respective labels to determine the class of an unlabeled example. The unlabeled object is subsequently categorized either by a majority vote by the neighborhood's dominant category or through a predominantly weighted majority whereby points nearer to the unlabeled object are given greater weight. KNN is considered a good classifier in healthcare data recognition and prediction [9] [20].
Nevertheless, to the best of our knowledge, the aforementioned three prominent classifiers have not been combined with PCA as feature extraction in predicting breast cancer recurrence.

Method
As shown in Fig. 1, the research method is broken down into five phases: data acquisition, preprocessing of data, model construction without PCA, model construction with PCA, and model comparison.

Phase One: Data Acquisition
In this phase, the study's pertinent data is acquired from UCI's public repository [34]. The Wisconsin Prognostic Breast Cancer (WPBC) dataset consists of 34 predictors/independent features and output/dependent features, in addition to 198 records. Individual record denotes further examination data for a breast cancer case for Dr. Wolberg's patients since 1984. There are 151 nonrecurrence cases and 47 recurrence cases. There exist missing values within the lymph node status feature in four cases.  The number of processes in the data preprocessing phase depends on developing a model with or without feature extraction. To develop a model without feature extraction, the data preprocessing phase comprises two processes: data cleaning and normalization. The second model adds a third step of feature extraction.
Over the course of the process of cleaning data, the missing datasets within the lymph node status were given the most probable value by employing the ReplaceMissingValues filter within the Weka. Notably, the ID feature was deleted from the dataset because it will not affect the outcome. This process reduced the number of predictor features to 33. Data normalization is useful when the dataset has varying scales. It is worth noting that the data acquired in Phase 1 was rescaled to a range between 0 and 1 by utilizing the Normalize filter within Weka to attain similar value ranges for every feature.
In the feature extraction process, PCA was applied in order to reduce the feature dimension. Reasons for opting for PCA are as follows: (1) PCA aims to capture as much information as possible with high explained variance, unlike any other algorithms that only select several important features that cause information loss; (2) PCA works best on datasets with three or higher dimensions, such as the WPBC dataset, which consists of 33 attributes, and since it has the highest dimensions, it is increasingly difficult to interpret the result; and (3) PCA is ideal for use on a dataset of numeric variables such as WPBC. When applying PCA, it is best to choose a few principal components with variance covered as high as possible. In Weka, we just need to set the variance covered to 0.95. The PCA algorithm automatically selected an optimal number of principal components, with 13 principal components representing 33 features by minimizing information loss. In other words, by using PCA, the number of predictors has been reduced from 33 to 13 without compromising on explained variance. The spree plot in Fig. 2 shows the number of principal components selected with the proportion of variance. The red line indicates the variance covered per component, and the green line indicates the cumulative variance covered by components.
PCA also provides the principal component loading (Fig. 3). It can be inferred that the first principal component, PC1, corresponds to a measure of 0.28813Mean_Concave_points + 0.277034Mean_Concavity + … + 0.0147931Lymph_node_status. Similarly, it can be said that the second component, PC2, corresponds to a measure of 0.301744Mean_Fractal_dimension + 0.288685Worst_Fractal_dimension + … − 0.00457267Worst_Texture. PCA then computes eigenvectors that are the principal component and respective eigenvalues that apprehend the magnitude of variance. Finally, the eigenpairs were arranged to decrease the order of respective eigenvalues, and the value with the maximum value was picked. This is the first principal component that protects the maximum information from the original data. The new data frame was created from 13 principal components and their eigenvalues. Table 1 presents the sample of the first ten rows of the data frame that will be used as an input in Phase 4.

Phase Three: Model Construction Without PCA
This stage entailed constructing classification models by employing three common classifiers, namely, KNN, NB, and REPTree, with a tenfold cross-validation test alternative by using Weka. The dataset was divided into ten pieces (folds), and each piece was then kept in turn for testing, and the remaining nine pieces were trained together. The average for ten evaluation results was calculated. After that, the classifier was invoked for the last (11th) time by Weka on the entire dataset to print out the final evaluation result.

Phase Four: Model Construction With PCA
This phase entailed carrying out feature extraction by utilizing the PCA obtained from Phase 2 to minimize the dimensionality of the dataset. After that, Phase 3 was repeated to build three classification models with the reduced feature set.

Phase Five: Model Comparison
The performance of the prediction models built with and without PCA was compared in this phase. Most previous research only used one or two performance criteria, leading to bias in the result discussion. Table 2 lists the performance criteria with their descriptions.

Results and Discussion
This section examines the models constructed in Phases 3 and 4 and the results obtained in the comparative analysis of Phase 5. Table 3 lists the summary statistics for the three classification models without feature extraction. The summary shows that REPTree outperforms the other two classifiers by correctly classifying 149 (75.25%) instances. However, the negative Cohen's kappa value (−0.0198) indicates that REPTree is not an effective classifier for predicting whether a patient has breast cancer recurrence. There is slight agreement to say that NB is an effective breast cancer recurrence classifier with Cohen's kappa value of 0.1794 and fair agreement to say that KNN is an effective breast cancer recurrence classifier with Cohen's kappa value of 0.2271. The prognostic breast cancer prediction duty is an imbalanced classification issue where two classes require to be predicted, namely, recurrence and nonrecurrence, with nonrecurrence indicating the tremendous majority of the data points. The confusion matrix is displayed in Table 4 breaks down the data in Table 3 and represents the actual and predicted labels from the classification results of the three models. The TPs are correctly identified recurrence cases, and TNs are correctly identified nonrecurrence cases. Conversely, FPs are patients who would be falsely identified as recurrence cases, and FNs are patients who would be falsely identified as nonrecurrence cases. Table 4 corroborates that the REPTree model is an ineffective classifier because it can correctly identify nonrecurrence cases (TN = 149) but not the recurrence cases (TP = 0).

Performance Criteria Description
Precision ( The detailed accuracy by class was then examined to inspect the performance of the three classifiers further. Table 5 represents the accuracy by class in detail for the three models without feature extraction. In this analysis, creating a balanced classification model with the optimal balance of recall and precision remains the top priority. The weighted average for recall, precision, and F-measure for two classes was calculated using (1).
where X can be the value for precision (P), recall (R), or F-measure (F), c1 is the number of instances in class 1, and c2 is the number of instances in class 2. Below is the example of the calculation of the weighted average for F-measure for NB:  Recurrence, b. Non-recurrence

Evaluation of the Models Constructed With PCA
Upon completing the feature extraction process, PCA transformed 33 correlated features into a novel set of 13 linearly uncorrelated principal components that captured over 95% of the training dataset's initial variance.
The results are shown in Table 6 imply that NB outperforms the other two classification models because it produced the highest accurately classified instances (77.78%) and the lowest inaccurately classified instances (22.22%). Although Cohen's kappa value is the highest value at 0.3047, NB only constructed a fair agreement that the model can separate the instances into the right class. PCA improves the performance of REPTree by 1.52%, increasing the correctly classified instances and 1.52%, reducing the inaccurately classified instances. Kappa statistic of 0.1927 shows the improvement of REPTree from an ineffective classifier to a slight agreement that it can be a promising breast cancer recurrence classifier. The confusion matrix analysis for the classification models with feature extraction (Table 7) verifies the results listed in Table 6. For each model, the sum of accurately classified instances equals the summation of TP and TN, and the sum of inaccurately classified instances equals the summation of FN and FP. For instance, the sum of accurately classified instances classified by NB is denoted as Accurately classified instances = TP+TN = 17+137 = 154. Recurrence, b. Non-recurrence Table 8 portrays the detailed accuracy by class for the classification models with feature extraction. The values of the F-measure (0.761) corroborate that NB is the best classification model that can be applied with PCA in predicting whether a patient has a breast cancer recurrence.

Comparative Analysis
The comparative analysis process entailed carrying out a comparison between models produced with and without PCA to determine the impact of decreasing feature dimensionality through principal component analysis on the outcome results. Fig. 4, Fig. 5, and Fig. 6 depict the performance for the three classification models built with and without PCA. The results show that when PCA is used for feature extraction, the performances of NB and REPTree improve by increasing the number of accurately classified instances, decreasing the number of inaccurately classified instances, and increasing the value of Cohen's kappa. However, this trend is not observed for KNN.  Fig. 7 presents the weighted average for each performance measure (precision, recall, and F-measure) for every classification model with and without PCA. The results confirm that the performances of NB and REPTree improve with PCA as feature extraction. This trend is not observed for KNN. It also exposes that the classification model (NB built with PCA) is superior against the other five classification models. A much higher recall of the NB (77.8%) built with PCA denotes its exceptional potential in predicting the recurrence case out, which is specifically essential for actual breast cancer patients. Fig. 7. Precision, recall, and F-measure for three models with and without PCA

The Difference from Prior Work
This study is unique and different from prior works in several ways. This research study was carried out by abiding by the key tenets of a systematic technique that amalgamated PCA with three popular classifiers, namely, KNN, NB, and REPTree, to forecast the recurrence associated with breast cancer, an entirely new experiment. Second, the comparative analysis was performed between the three classifiers, not only with PCA but also without PCA as a control to the experiment. Finally, this experiment's findings have been deliberated thoroughly by employing a raft of performance metrics, key among them being accurately classified instances, inaccurately classified instances, F-measure, Kappa statistics, recall, confusion matrix, and precision to avert bias.

Conclusion
This investigation aimed to compare and improve the performance of three established data mining algorithms, namely, NB, KNN, and REPTree, using PCA for feature extraction in predicting breast cancer recurrence. The comparison was conducted between models built with and without PCA. PCA, an unsupervised learning method, was employed to remove the repeated data and extract novel principal components to substitute the initial feature data. To carry out the study, a threshold of 95% was used to decrease the feature's dimension from 33 to 13 while retaining various principal components that signified roughly 95% variance between the initial dataset. These preprocessing stages provided a greatly valuable and reduced feature set that allows the MLA to train a classifier. The comparative analysis results revealed that PCA's involvement significantly improved the classifier's breast cancer recurrence detection ability for the WPBC dataset. Overall, this study strengthens the idea that without feature extraction, NB and REPTree's performance falls short in the ability to detect breast cancer recurrence. In contrast, applying PCA to cultivate and decrease the number of features increases the breast cancer recurrence detection possibility of NB by approximately 10% and REPTree by 2%, which is crucial to real patients with breast cancer. The results disclose the significance of minimizing feature dimensionality, particularly to classifiers whose performances can be significantly affected by the considerable quantity of features. In conclusion, this study shows that two out of three classifiers, NB and REPTree, outperformed when applying PCA as feature extraction with F-measure values equal to 76.1% and 72.8%, respectively. Thus, it can be considered to improve breast cancer recurrence prediction of the WPBC dataset by researchers and practitioners. Further research should be carried out to explore another feature extraction technique in decreasing the dimensionality of the prognostic breast cancer set of data to improve classification models' performance in predicting recurrence. We should also study machine learning techniques to handle the imbalanced data issue in the prognostic breast cancer dataset.