Feature selection to increase the random forest method performance on high dimensional data

Random Forest is an ensemble learning method [1] for classification and regression by building a number of decision trees in a forest and predicting the results by voting [2]. In addition, The algorithm built a decision tree without pruning [3], and in the classification, the approach uses a combination of "bagging" Breiman and random feature selection [4]. The Random Forest classification is done by combining the tree that conducting training on the owned data sample. The selection of features to build a Random Forest tree is made randomly at the beginning of the algorithm. Random selection of features based on impurity measures is used as a criterion to determine the best features for partition nodes. From this random selection of features, a decision tree is forming. The decision tree will be built as many times as desired. The use of the decision tree will increasingly affect the obtained accuracy. The number of trees in the forest gives the results in higher accuracy. From several decision trees that have been built, Random Forest classification is done by voting. The winners are the most votes from the decision tree formed. The performance of Random Forest depends on the diversity of the forest decision A RTIC L E IN F O ABSTRACT


Introduction
Random Forest is an ensemble learning method [1] for classification and regression by building a number of decision trees in a forest and predicting the results by voting [2]. In addition, The algorithm built a decision tree without pruning [3], and in the classification, the approach uses a combination of "bagging" Breiman and random feature selection [4]. The Random Forest classification is done by combining the tree that conducting training on the owned data sample. The selection of features to build a Random Forest tree is made randomly at the beginning of the algorithm. Random selection of features based on impurity measures is used as a criterion to determine the best features for partition nodes. From this random selection of features, a decision tree is forming. The decision tree will be built as many times as desired. The use of the decision tree will increasingly affect the obtained accuracy. The number of trees in the forest gives the results in higher accuracy. From several decision trees that have been built, Random Forest classification is done by voting. The winners are the most votes from the decision tree formed. The performance of Random Forest depends on the diversity of the forest decision Random Forest is a supervised classification method based on bagging (Bootstrap aggregating) Breiman and random selection of features. The choice of features randomly assigned to the Random Forest makes it possible that the selected feature is not necessarily informative. So it is necessary to select features in the Random Forest. The purpose of choosing this feature is to select an optimal subset of features that contain valuable information in the hope of accelerating the performance of the Random Forest method. Mainly for the execution of high-dimensional datasets such as the Parkinson, CNAE-9, and Urban Land Cover dataset. The feature selection is done using the Correlation-Based Feature Selection method, using the BestFirst method. Tests were carried out 30 times using the K-Cross Fold Validation value of 10 and dividing the dataset into 70% training and 30% testing. The experiments using the Parkinson dataset obtained a time difference of 0.27 and 0.28 seconds faster than using the Random Forest method without feature selection. Likewise, the trials in the Urban Land Cover dataset had 0.04 and 0.03 seconds, while for the CNAE-9 dataset, the difference time was 2.23 and 2.81 faster than using the Random Forest method without feature selection. These experiments showed that the Random Forest processes are faster when using the first feature selection. Likewise, the accuracy value increased in the two previous experiments, while only the CNAE-9 dataset experiment gets a lower accuracy. This research's benefits is by first performing feature selection steps using the Correlation-Base Feature Selection method can increase the speed of performance and accuracy of the Random Forest method on highdimensional data. tree and the performance of each decision tree [5]. Breiman formulates a set of trees' overall performance as average strength and average correlation between trees [4], and it shows that generalization errors from a Random Forest classifier are limited by the average correlation ratio between trees divided by the square of the strength of the average tree [5].
Seeing how the Random Forest feature selection works is done randomly and is done repeatedly, It may cause the computing process in the Random Forest is taking a long time. It is also possible that random features on the Random Forest are not informative, especially if using high dimensional data. There are several studies on feature selection [6]- [17] especially the use of feature selection in Random Forest [18] [19], namely research conducted by Amaratungga [20]. This study used individual weighting features and proved to provide an increase in classification performance, but there is a possibility that features with large weights are chosen repeatedly. In contrast to Ye's research, which groups feature into two groups, groups contained strong informative features and weak formative features [5]. This study shows better performance than other algorithms such as SVM and four variants of Random Forest, Nearest Neighbor (NN), and Naïve Bayes (NB) algorithms. Two random forest feature selection studies have the aim of improving performance and accuracy. Another feature selection study was conducted by Manbari et al. [13]. Manbari et al. [13] presented a new hybrid filter-based feature selection algorithm, combining modified Clustering and Binary Ant System (BAS). The proposed model provides global and local search capabilities between and within clusters. The proposed method achieves better performance than other feature selection methods and reduces computational complexity. This study's disadvantage was greatly reducing the number of clusters and selectivity of features. Lu [21] conducted research using the embedded method, which proposed the Sparse Optimal Scoring with Adjustment (SOSA) method. Experimental results on synthetic data and three datasets show that the features selected by the SOSA method can consistently produce better or comparative classification performance compared to features chosen by traditional embedded methods. Moran and Gordon [14] also proposed a feature selection method called Curious Feature Selection (CFS) that presents the same accuracy as the simple and greedy Sequential Forward Selection algorithm. The advantages of the proposed Curious Feature Selection algorithm are overfitting, online learning, and scale. Although Manbari et al. [13] study performed quite well in terms of time and accuracy, the process he used is quite complicated. In comparison, Moran and Gordon [14] have a positive impact on the accuracy problems.
In this study, we focus on improving the execution time and accuracy using the Correlation-Base Feature Selection with the best first method applied to high dimensional datasets. The results are then compared to the Random Forest method's without feature selection. The speed and accuracy testing of the original Random Forest method (without a selection of features) is done by Random Forest, which has used feature selection first. The dataset used in the test is UCI's high dimensional dataset, i.e., Parkinson, CNAE-9, and Land Cover dataset. This paper is structured as follows: Introduction is outlined in Section 1; the method is explained in Section 2. Section 3 describes the results of the experiments, and section 4 is the conclusion.

Method
The research used UCI's datasets and analyzed them using Weka tools software version 3.9.2. The datasets used are the CNAE-9, Parkinson, and Urban Land Cover high dimension dataset that has been used by Sakar et al. [22] while the Urban Land Cover dataset has been used by Johnson and Xie [23] and Johnson [24]. The test was conducted in 30 repetitions, using K-Cross Fold Validation with K = 10, 70% exercise split and, 30% test. Cross Fold validation is one technique that allows all datasets to be training data as well as test data. The Weka's Cross Fold Validation default is 10, which is meaning the randomized 10 times to validate the research dataset. Each test is done by changing the value of the seed in Weka that is a Weka's function to generate random data. The 15 seeds are entered sequentially from numbers 1 to 15, while 15 other seeds are randomly entered.
The selection of features used is attribute selection or attribute evaluator Correlation-based feature selector (CfsSubsetEval), using the BestFirst method. CfsSubsetEval is a method that evaluates the value of attribute subset by considering each features's predictive capabilities and the level of redundancy between features/attributes [29]. The BestFirst method is a search algorithm based on optimizing the best value. The results of feature selection are applied to the dataset used. After that, the dataset is analyzed by using Random Forest. More details can be seen in Fig. 1 and Fig. 2.

Experiment Results
The study was conducted by testing the high dimensional dataset 30 times using different seed values available in Weka tools software version 3.9.2. The trial uses the K-Cross Fold Validation value K = 10 and splitting the datasets into 70% training and 30% testing.

First experiment
In the first series of experiments, the UCI's Parkinson dataset was used that consists of 756 instances, collected from 188 patients (107 men and 81 women), aged around 33 to 87 years. This dataset is highdimensional with 755 attributes (features), and 1 instance class attribute. The classification carried out by the Random Forest method uses 754 randomly selected features, with 30 different seed values. The first experiment using K-Cross Validation produced an average accuracy of 86.66% with an average speed of 0.48 seconds ( Table 1). The percentage of 70% split the average accuracy is 85.17% with an average speed of 0.47 seconds ( Table 2). The results obtained by feature selection using attribute selection or attribute evaluator Correlation-based feature selector (CfsSubsetEval) with the BestFirst method, the Random Forest with K-Cross Validation average accuracy is 88.46% and its average speed of 0.20 seconds ( Table 1). The accuracy of the percentage of 70% is 86.77% with 0.20 seconds average speed ( Table 2). Both the accuracy and the average speed of the Random Forest method, which previously performed feature selection, are faster and more accurate.

The second experiment
The second series of experiments used UCI's CNAE-9 dataset that contains 1080 documents about the description of free text business from the privatized Brazilian company. This dataset is a highdimensional dataset with 857 attributes (features), 1 instance class attribute, and 856-words frequency attributes in integer form. The Random Forest method is used for the classification of the 857 randomly selected features. The Experiments of the K-Cross Validation produced 93.72% average accuracy and 2.49 seconds average speed (Table 3), while 94.20% average accuracy and 3.08 seconds average speed were produced by the percentage of 70% split (Table 4).  The Random Forest method with K-Cross Validation and feature selection obtained an average accuracy value of 81.23% and 0.26 seconds average speed ( Table 3). The percentage of 70% split the average accuracy is 81.75% with an average speed of 0.27 seconds (Table 4).

Third Experiment
Urban Land Cover data taken from Urban training UCI data was used in the third set of experiments. The Urban Land Cover data contains 168 training data for the High-resolution urban land-cover classification. This dataset has 148 attributes (features), 1 instance class attribute, and 147-words frequency attributes in an integrated form. The classification was done by the Random Forest method that used random 857 features. K-Cross Validation produced an average accuracy of 85.08% with an average speed of 0.10 seconds ( Table 5). As for the 70% split percentage, the average accuracy is 82.13%, with an average speed of 0.08 seconds ( Table 6). The Random Forest method with K-Cross Validation and feature selection obtained an average accuracy value of 87.52% with an average speed of 0.06 seconds ( Table 5). As for the 70% split percentage, the average accuracy is 87.27%, with an average speed of 0.05 seconds ( Table 6).

Discussion
The original Random Forest method obtained a higher average accuracy than the Random Forest method that has been selected first (Fig. 3). However, the average speed that results from Random Forest using the feature selection first is much faster than the original Random Forest (Fig. 4).
From the previous experiments, it can be proven that making feature selection may affect the processing speed of the Random Forest method. The first experiment uses the Parkinson dataset, with K-Cross Validation, indicating that the average speed required by the first feature selection is 0.2 seconds. Whereas without feature selection, it takes 0.48 seconds. There is a difference in time needed, which is 0.28 seconds. In terms of time or speed getting faster, the accuracy increased by 1.8%, from 86.66% (without feature selection) to 88.46% (using feature selection). The same dataset is also tested using 70% split training data and 30% testing. The average speed required from 0.47 seconds to 0.20 seconds and accuracy increased from 85.17% to 86.77%. There is a decrease in the average time needed, which is equal to 0.27 seconds faster, and an increase in the average value of accuracy is 1.6%.  The second experiment was conducted using CNAE-9 data, with K-Cross Validation, it was found that there was a difference of 2.23 seconds from the average speed required in Random Forest without feature selection with Random Forest using feature selection. In the original Random Forest, the average time needed is 2.49 seconds, while the Random Forest with feature selection requires a shorter time, which is 0.26 seconds. Seen a decrease in time of 2.23 seconds faster than Random Forest without feature selection. However, for accuracy, it turns out that Random Forest without feature selection produces a much higher average accuracy than Random Forest with feature selection, which is 93.72%, 12.49% faster than Random Forest using feature selection. Likewise, the results obtained from trials using the percentage of split 70% and 30%, the average accuracy of Random Forest without feature selection is superior to 12.45% compared to Random Forest, which uses feature selection. While the Random Forest's speed that uses feature selection is 2.81 seconds faster than Random Forest without feature selection.
The third experiment was carried out using Land Cover Urban data, with K-Cross Validation, showing that the average speed required by the first feature selection was 0.06 seconds. Meanwhile, without making a feature selection, it takes 0.10 seconds. There is a time difference needed, which is 0.04 seconds. The average accuracy's total results also increased by 2.19%, from 85.08% without feature selection to 87.27% with feature selection. It means that in terms of time or speed, It is getting faster. The same dataset was tested using a 70% split percentage of exercise data and 30% test. The average speed needed from 0.08 seconds to 0.05 seconds, and accuracy increased from 82.13% to 87.27%. There is a decrease in the average time required, which is 0.03 seconds, and an increase in the average accuracy by 5.14%.
The trials were conducted on Random Forest high dimension data with feature selection using K-Cross Validation and the percentage of the split by 70% -30%, increasing execution speed. However, the average accuracy produced in the CNAE-9 dataset decreased by 12.49% and 12.45% compared to Random Forest without feature selection. It was possible to happen because there was too much irrelevant data or sparse data on the CNAE-9 dataset [30]. Therefore a method/algorithm is needed in future feature selection research. The average accuracy result was increasing, or at least the same as the Random Forest accuracy without feature selection.

Conclusion
This research showed that selecting the Correlation-based feature selector (CfsSubsetEval) feature with the BestFirst method can speed up the Random Forest method's classification process time and improve its accuracy. This can be proven from the prior test completed on the high dimensional Parkinson dataset, high dimensional CNAE-9 dataset and Urban Land Cover high dimension data. The average execution speed increases between 0.27 seconds to 2.81 seconds. In addition to the increasing average speed, the Random Forest method's average accuracy with feature selection also increases when tested on the Parkinson and Urban Land Cover dataset. However, when tested on CNAE-9 data, the average accuracy dropped. This might be due to a sparse problem. The further experimental development may go to seek the new method or feature selection algorithm that both increase speed and more accurate results (sensitivity and specificness) to employ on random forest method.