A hybrid ensemble deep learning approach for reliable breast cancer detection

Article history Received January 18, 2021 Revised March 29, 2021 Accepted April 2, 2021 Available online April 20, 2021 Among the cancer diseases, breast cancer is considered one of the most prevalent threats requiring early detection for a higher recovery rate. Meanwhile, the manual evaluation of malignant tissue regions in histopathology images is a critical and challenging task. Nowadays, deep learning becomes a leading technology for automatic tumor feature extraction and classification as malignant or benign. This paper presents a proposed hybrid deep learning-based approach, for reliable breast cancer detection, in three consecutive stages: 1) fine-tuning the pre-trained Xception-based classification model, 2) merging the extracted features with the predictions of a two-layer stacked LSTM-based regression model, and finally, 3) applying the support vector machine, in the classification phase, to the merged features. For the three stages of the proposed approach, training and testing phases are performed on the BreakHis dataset with nine adopted different augmentation techniques to ensure generalization of the proposed approach. A comprehensive performance evaluation of the proposed approach, with diverse metrics, shows that employing the LSTM-based regression model improves accuracy and precision metrics of the fine-tuned Xception-based model by 10.65% and 11.6%, respectively. Additionally, as a classifier, implementing the support vector machine further boosts the model by 3.43% and 5.22% for both metrics, respectively. Experimental results exploit the proposed approach's efficiency with outstanding reliability in comparison with the recent stateof-the-art approaches.

Memory (LSTM) model as a regressive one. The former can exploit spatial correlation in data images, while the latter can make predictions in these data sequences. For BC researches, a publicly available benchmark dataset [7], with a new histopathological database of microscopic breast tumor images (BreakHis) is introduced. It is widely employed in evaluating state-of-the-art BC detection approaches [8]- [12].
The BC image recognition schemes can typically be classified into two main categories based on feature extraction methods: hand-crafted extraction and automatic extraction [13], [14]. Different research works have been published concerning cancer detection using machine learning techniques [15], [16]. However, such methods' applications are limited due to manual feature extraction that can be considered a critical step of BC detection. Traditionally, SIFT [17] and SURF [8] hand-crafted feature descriptors were being utilized to feature extraction until the advent of DL techniques that can extract more discriminative information from data with no need to design feature extractors by human experts.
In the second category of feature extraction, the DL techniques offer an automated, accurate, and reliable methodology for learning features from medical images in a way that avoids the constraints of such hand-crafted features [18]. CNN's, as a type of deep forward networking, have achieved empirical success in automatic diagnosis and analysis of the BC in histopathological images [3]- [6] to classify images into one of two classes benign (tumor-free) or malignant (tumored). Learning DL models from scratch in large data sets [19] is a tedious task due to computational complexity and convergence problems [20]. Furthermore, in the case of an insufficient amount of high-quality labeled samples, as in most common medical BC datasets [7], one can benefit from applying Transfer Learning (TL) [21], [22] to one of the top-ranked pre-trained models for faster convergence and outperforming training from scratch [20]- [23].
In [8], authors showed that using fine-tuning, the pre-trained VGG16, VGG19, and ResNet50 models achieved improved accuracy but with only 92.0% precision a reliability indicator. In [9], the twostep TL-based approach is proposed for feature extraction from histopathological images using Inception-v3 and SVM classifier that improved the classification accuracy by 3.7%. The use of multiple instances learning for histopathological BC detection is investigated in [10]. However, the presented average accuracy is only 88%. A TL-based model is proposed and trained, in [11], on stain normalized and augmented BreakHis dataset. Based on accuracy and precision metrics, the observed results are 81.25% and 91.79%, respectively. A TL on the pre-trained Xception model, in [12], is applied. However, an important evaluation metric such as precision to study the proposed approach's reliability is not presented. The combination of the pre-trained CNN activation features on SVMs has been investigated [24], while another combination of CNN and LSTM in [25] achieves an average precision over the four categories of the BreakHis dataset of only 90.25%. In [26], a compact CNN approach achieves accuracy and precision of 87.40% and 88.08%, respectively. In [27], a multi-layer feature fusion for BC image classification is proposed, in which the independence and partial dependence of all sublayers are considered. A deep convolution generative adversarial network, in [28], is proposed to balance the BC data set class distribution by the augmentation of only minor classes to avoid the classifier bias toward the majority class. In [29], a proposed ensemble deep learning approach achieved 95.3% of accuracy with a lower precision value of 93.5%.
In this paper, a hybrid approach of the TL-based classification model and regression model, for more tuned and robust feature extraction, is suggested to comb with SVM classifier for highly accurate and reliable BC detection. This work investigates breast cancer detection using a combined Xception-based classification approach and LSTM-based regression one for highly tuned extracted features that feed a robust Support Virtual Machine (SVM) classifier. Combining both classification and regression approaches leads to a highly reliable efficiency of the proposed approach in accuracy, precision, and different false rates. Section 2 presents the overall methodology of the proposed approach in a clarified sequence of stages, while Section 3 discusses the experimental work in a detailed analysis of results against those of the recent competing state-of-the-art approaches. Then, Section 4 concludes the work by highlighting the main results followed by possible future works.

Stage of implementing a convolutional-based classification
In the proposed approach, the pre-trained Xception-based model is applied in a fine-tuned manner. Xception model is a deep-CNN model in the form of a linear stack of depthwise separable convolution layers, with residual connections in a modified version. A modified depthwise separable convolution consists of 1*1 pointwise convolution that maps cross-channel correlations, followed by n*n depthwise convolution for separately mapping every channel's spatial correlations. Depthwise separable convolution provides greatly reduced parameter count, more efficient complexity, maintains cross-channel features. For n*n convolutional layer on k input channels and m output channels, regular convolution generates (k*n*n*m) parameters, but with depthwise separable convolution, count of (depthwise Conv. + spatial Conv.) = (k*1*1*m + n*n*m), parameters are generated as illustrated in Fig. 1. Xception architecture has outperformed VGG16, ResNet, and Inception-V3 in most classical classification challenges [30]- [32]. Xception model comprises 36 convolutional layers forming the feature extraction base of the network structured into 14 modules, all of which have linear residual connections around them, except for the first and last modules. It is previously trained on a 1000-class single-label classification task on the ImageNet dataset [19] of more than 14 million images.

Applying Transfer Learning
In this step, the TL is applied to the Xception-based model for the BC detection task. Recent implementations of DL-based models, as in Fig. 2(a), adopt one of two different main methods: the first method is by learning the model from scratch on the large dataset for achieving better accuracy, while the second method incorporates TL, in which the parameters of a pre-trained model for a specific task with high accuracy, are used to initialize the new model with the necessary modification towards a required task. TL is mainly useful for tasks where enough training samples are not available to train a model from scratch, such as medical image classification [21]- [33]as in Fig. 2

115
International Journal of Advances in Intelligent Informatics ISSN 2442-6571 Vol. 7, No. 2, July 2021, pp. 112-124 Generally, the low levels of DL models provide generic features, while the higher achieved the specific features. The learned features are related to the task of the pre-trained model. Therefore, there are two main factors in transfer learning, upon which the pre-trained model can be used towards a new task. These two main factors are 1) the size of the targeted dataset and 2) the similarity of the new task to that of the pre-trained model. These considered factors lead to four different cases, as shown in Fig. 3.
• Case1 is for a small data set and similar task, in which the high-level features, i.e., from top layers, are specific for the same and can be used. Hence, the original model is applied as a feature extractor with no modification, and just the classifier on top of it can be retrained. • Case2 is for small data set and different tasks for which high-level features cannot be used. Hence, the original pre-trained model can be applied as a feature extractor but should be retrained from a low level to the end of the model, i.e., fully connected layers that provide more generic features than those from a higher layer. From the start of the pre-trained model to a selected lower level, it is kept frozen.
• Case3 is for large data set and similar task, such that the pre-trained weights, from a low level of the model, should be fine-tuned. The pre-trained model should be relearned, starting from a lower level, i.e., high-level convolutional layers, for some new learned features. • Case4 is for large data set and different tasks, allowing the whole base model to be fine-tuned and relearned with that amount of data. In the first stage of the proposed approach, the TL technique, as in Case 4, is applied to the pretrained Xception model for the BC detection task. The step of applying TL incorporates the pre-trained Xception-based model with three randomly initialized Fully Connected (FC) layers of dimensions DFC1, DFC2, and DFC3, respectively, an LR layer in the form of a two-node dense layer, and a binary crossentropy activation function, as shown in Fig. 4.

Stage of merging Xception-based classification with LSTM-based regression
In the second stage of the proposed approach, a Recurrent Neural Network (RNN) model, as the regressive branch, is suggested to provide predictions for sequences of data in images that can be applied as multiplicative values for the features extracted from the fine-tuned Xception-based model for more enhanced extracted features. In the regressive branch, the LSTM network, as a special type of RNN, is implemented to learn long-term dependencies and overcome the previously vanishing and exploding gradients of typical RNN [34].
LSTM model consists of multiple looped networks. Each network, in the loop, takes input information from the preceding network and produces output besides passing the information to the next network. The repeating module of the LSTM is the memory cell that consists of various gates: an input gate for controlling the amount of previous information to pass, forget gate for selecting allowable values to be updated, and an output gate for deciding information carried by the hidden state [34], [35]. A stacked version of the LSTM architecture is such an LSTM model with multiple LSTM layers to enhance the prediction efficiency, making the model deeper. For the stacked LSTM-based approach, two layers are recommended to avoid the degradation problem, in which the model becomes more difficult to train; hence the prediction accuracy will be saturated [35], [36].   The model feature vector Fmodel is then applied to three FC layers and the LR layer that are mentioned in the first stage. Training of the second stage of the proposed approach passes two sequential steps. In the first training step, the fine-tuned Xception-based branch, from the first stage, is frozen and the LSTM-based branch combined with the three randomly initialized FC layers are trained from scratch, then in the second training step, the whole system, including the two branches of Xception model and LSTM model, are fine-tuned.

Stage of implementing SVM classifier for the merged Xception and LSTM features
In the third stage, i.e., the final one of the proposed approach, the three FC layers and the LR layer in the second stage are replaced by an SVM classifier. SVM becomes a powerful machine learning tool for binary as well as multi-class labeling scenarios [37]. It is extensively used in computer vision applications especially, medical ones [38]- [40].
Given a sample test image as a feature vector Fmodel of dimension N that is closest to the hyperplane H, it forms an orthogonal vector d that stems from it in the same direction as w. Any point X 0 Є H (corresponding y 0 = 0) will form a vector r with Fmodel in which d is the projection of r on w given by = 1/||W||. Therefore, one can easily find the optimal margin by maximizing which is a kernel function that is used to express the product of F m model and F i model inputs. (2) Kernel functions allow the transformation from non-linearly separable spaces to linearly-separable ones and are considered a useful tool for solving diverse classification tasks [41]. For each feature map Fmodel, a slack variable ԑ is defined, which is zero for points on the margin and increases as going further from the correct boundary, till the point on the wrong side, in which ԑ is expected to be greater than the value 1. By substitution of ԑ into L, it is needed to minimize ��| |�/2 + � =1 � .
In the final stage of the proposed approach, the three FC layers and the LR layer, as mentioned in the second stage shown in Fig. 5, are replaced by the SVM classifier as shown in Fig. 6.

Dataset Description
In experiments, the BreakHis dataset [7] is used to validate the proposed approach during its three stages. It contains a total of 9109 sample images, each categorized as either benign (2450 samples) or malignant (5429 samples). The samples were collected from 82 patients with different magnification factors (40x, 100x, 200x, 400x) in an RGB format with a resolution of 700*460*3.

Pre-processing and Data Augmentation
In the pre-processing phase, all histopathological images in the dataset are normalized to reduce the color variation that enhances the color consistency. Training a DL model on the larger dataset is the best way to generalize it and to minimize overfitting probability in the obtained results. Besides that, to avoid the degradation that may affect the state-of-the-art deep predictive models due to the data scarcity problem, data augmentation is recently used to artificially expand the labeled training dataset, which is essential for combating such data scarcity problem [42] [43]. For these considerations, nine data augmentation techniques 1) horizontal shift, 2) vertical shift, 3) horizontal-vertical shift, 4) horizontal flip, 5) vertical flip, 6) random rotation, 7) random brightness, 8) random zoom, and 9) Gaussian noise, are applied to the dataset before training that enlarges the dataset to 10-times the original size and in turn improves the model generalization.

Performance Evaluation Metrics
In experiments, a list of different metrics [24] that targets the proposed approach's accuracy and reliability is considered. The targeted metrics in (3)  • Precision, the higher is, the better, measures the model's performance in terms of positive example classification, i.e., percentage of the Positively Predicted Values (PPV) that were truly positive. It indicates the model reliability in cases where FP is a higher concern than FN (4). (4) • The Negative Predicted Values (NPV), the higher is, the better, can be defined as (5). (5) • The False Discovery Rate (FDR), the lower is better. It can be defined as (6). • The False Positive Rate (FPR) and False Negative Rate (FNR), the lower values of each are the better, can be calculated as: Besides the considered metrics, the Receiver Operating Characteristic (ROC) graph is extracted for the proposed approach's three stages. ROC curve shows the True Positive Rate (TPR) as a False Positive Rate (FPR) function. Additionally, The Area Under the Curve (AUC) of the ROC curve, the higher and better, is the classification model's capability to discriminate between different classes. While the ROC is a two-dimensional representation of the model's performance, the AUC provides this information in a single scalar representation form. It is a commonly used evaluation method for binary classification tasks as it provides a better assessment of the model's ability to discriminate between the two classes.

Model Setting and Training Hyper-parameters
In experiments of all the three stages towards the proposed model, the trainable three FC layers have dimensions: DFC1 = 2048, DFC2 = 512 and DFC3 = 128. For the LSTM-based branch, as in Fig. 5 and Fig. 6, each chunk is of size m = 3500, and the time series t = 92 steps. The output of both the Xceptionbranch and the LSTM-based branch is of size N = 2048, which also corresponds to the merge layer's dimension. In the third stage, the SVM module is supported by linear kernel function and hyperplane parameter C = 10.
The training dataset, i.e., the BreakHis, is partitioned into 70%, 10%, and 20% for training, validation, and testing phases, respectively. The suggested nine augmentation techniques are applied to only 80% of the dataset associated for the training and validation phases in the proposed approach. The rest 20% of the dataset is left without augmentation for testing to achieve high reliability with the obtained results. K-fold cross-validation is adopted to the training phase of the dataset where K = 10 to get a less biased model, avoid the overfitting, and lead to better generalization of the predictive model [44]. In the training phase, the learning rate = 0.0001 with 0.3 dropout rate and Adam optimizer. Batch size = 64 and associated number of epochs = 100.

Comparative Results and Analysis
Classification results of the three consecutive stages of the proposed approach are presented to entirely evaluate each stage's contributions. The proposed approach's performance is further analyzed by the associated ROC curves, as shown in Fig. 7. The average AUC values for the first, second, and third stages of the proposed approach are 0.84, 0.92, and 0.95, respectively. The resulted AUC values show an enhancement of 9.52% for implementing the LSTM-based model in the second stage, and an additional one of 3% by applying the SVM classifier at the final stage.
The obtained performance metrics of the three stages: S1, S2, and S3, respectively, of the proposed approach, are shown in Table 1, including accuracy, precision, FPR, FDR, and FNR, associated with the four subcategories of the BreakHis dataset: 40x, 100x, 200x, and 400x. Experimental results in Table  1 show significant enhancements of the second stage, in all evaluation metrics for all subcategories of the dataset, and additional enhancements provided by the third stage, in almost all cases except some cases, in which the second stage outperforms the third one with small values in comparison with those between the second and the first stage.
Moreover, as shown in Fig. 8, the overall incremental enhancements of the second stage over the first one are 10.65%, 11.6%, 48.26%, 49.38%, and 45.04% for metrics: accuracy, precision, FDR, FPR, and FNR, relatively, while the corresponding enhancements of the third stage over the second stage are: 3.43%, 5.22%, 46.88%, 48.78%, and 13.89%, respectively. These significant enhancements in FDR, FPR, and FNR metrics demonstrate the proposed approach's high prediction reliability.  For evaluating the proposed approach's effectiveness, a comparative analysis with the results of recent state-of-the-art related approaches, using the same BreakHis dataset, is presented in Table 2. Among various metrics [45] that can be considered for evaluating classification models, the accuracy metric is the most frequently used in the related state-of-the-art approaches. A related point to consider is that even the model is of high accuracy. It may not predict the actual cancer patients reliably, leading to severe consequences, especially if there is a significant disparity between the number of positive and negative labels in BreakHis.
According to Table 2, the results show significant improvements of the proposed approach and outstanding reliability from a precision perspective against the competing state-of-the-art ones, evaluated on the same benchmark dataset, i.e., the BreakHis. It is worth mentioning that some of these approaches present their results only from the accuracy perspective, which can not exploit the model reliability. Hence, approaches as in [8]- [10], [12], and [27][28] cannot be verified as a reliable BC detection system.

Conclusion
This paper presents a hybrid ensemble deep learning approach for reliable breast cancer detection. The presented approach combines the pre-trained Xception model and two-layer stacked LSTM model for enhanced extracted features, upon which the SVM classifier employs breast cancer detection. BearkHis dataset is implemented in training and testing phases, with an additional nine applied different data augmentation techniques to boost the performance and the reliability of the proposed approach. Experimental results demonstrate that incorporating the regression-based LSTM branch improves the fine-tuned Xception-based model, especially in accuracy, precision, FDR, FPR, and FNR by ratios 10.65%, 11.6%, 48.26%, 49.38%, and 45.04%, respectively. An additional improvement of 3.43%, 5.22%, 46.88%, 48.78%, and 13.89% for the same metrics are provided by applying the SVM classifier on the merged extracted features. Comparative results, among the proposed approach and a recent list of state-of-the-art approaches, show a significant outperforming of the proposed approach by values of 94% and 95% for both accuracy and precision metrics, respectively, which proves its high reliable efficiency in BC detection. As future work, the proposed approach can be implemented in detecting different cancer types in histopathology images and even in the detection of COVID-19 in X-Rays. Moreover, sequence-based DL models as attention-based ones, can be implemented in a new hybrid DL model and compared with the proposed one. Acknowledgment The authors thank the anonymous reviewers for their valuable support and effort to improve the manuscript quality greatly. The authors also thank the Department of Computer Engineering and Artificial Intelligence at the Military Technical College (Cairo, Egypt) for their appreciated encouragement.

Declarations
Author contribution. All authors contributed equally to the main contributor to this paper. All authors read and approved the final paper. Funding statement. None of the authors have received any funding or grants from any institution or funding body for the research.