Feature selection to increase the random forest method performance on high dimensional data

(1) * Maria Irmina Prasetiyowati Mail (Institut Teknologi Bandung, Indonesia)
(2) Nur Ulfa Maulidevi Mail (Institut Teknologi Bandung, Indonesia)
(3) Kridanto Surendro Mail (Institut Teknologi Bandung, Indonesia)
*corresponding author


Random Forest is a supervised classification method based on bagging (Bootstrap aggregating) Breiman and random selection of features. The choice of features randomly assigned to the Random Forest makes it possible that the selected feature is not necessarily informative. So it is necessary to select features in the Random Forest. The purpose of choosing this feature is to select an optimal subset of features that contain valuable information in the hope of accelerating the performance of the Random Forest method. Mainly for the execution of high-dimensional datasets such as the Parkinson, CNAE-9, and Urban Land Cover dataset. The feature selection is done using the Correlation-Based Feature Selection method, using the BestFirst method. Tests were carried out 30 times using the K-Cross Fold Validation value of 10 and dividing the dataset into 70% training and 30% testing. The experiments using the Parkinson dataset obtained a time difference of 0.27 and 0.28 seconds faster than using the Random Forest method without feature selection. Likewise, the trials in the Urban Land Cover dataset had 0.04 and 0.03 seconds, while for the CNAE-9 dataset, the difference time was 2.23 and 2.81 faster than using the Random Forest method without feature selection. These experiments showed that the Random Forest processes are faster when using the first feature selection. Likewise, the accuracy value increased in the two previous experiments, while only the CNAE-9 dataset experiment gets a lower accuracy. This research’s benefits is by first performing feature selection steps using the Correlation-Base Feature Selection method can increase the speed of performance and accuracy of the Random Forest method on high-dimensional data.


Random forest; Feature selection; BestFirst method; High dimensional data; CNAE-9 dataset




Article metrics

Abstract views : 542 | PDF views : 43




Full Text



[1] C. Hu, Y. Chen, L. Hu, and X. Peng, “A novel random forests based class incremental learning method for activity recognition,” Pattern Recognit., vol. 78, pp. 277–290, Jun. 2018, doi: 10.1016/j.patcog.2018.01.025.

[2] E. Scornet, G. Biau, and J.-P. Vert, “Consistency of random forests,” Ann. Stat., vol. 43, no. 4, pp. 1716–1741, Aug. 2015, doi: 10.1214/15-AOS1321.

[3] D. Talreja, J. Nagaraj, N. J. Varsha, and K. Mahesh, “Terrorism analytics: Learning to predict the perpetrator,” in 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), 2017, pp. 1723–1726, doi: 10.1109/ICACCI.2017.8126092.

[4] L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, 2001, doi: 10.1023/A:1010933404324

[5] Y. Ye, Q. Wu, J. Zhexue Huang, M. K. Ng, and X. Li, “Stratified sampling for feature subspace selection in random forests for high dimensional data,” Pattern Recognit., vol. 46, no. 3, pp. 769–787, Mar. 2013, doi: 10.1016/j.patcog.2012.09.005.

[6] J. Cai, J. Luo, S. Wang, and S. Yang, “Feature selection in machine learning: A new perspective,” Neurocomputing, vol. 300, pp. 70–79, Jul. 2018, doi: 10.1016/j.neucom.2017.11.077.

[7] P. Martín-Smith, J. Ortega, J. Asensio-Cubero, J. Q. Gan, and A. Ortiz, “A supervised filter method for multi-objective feature selection in EEG classification based on multi-resolution analysis for BCI,” Neurocomputing, vol. 250, pp. 45–56, Aug. 2017, doi: 10.1016/j.neucom.2016.09.123.

[8] H. Zhou, Y. Zhang, Y. Zhang, and H. Liu, “Feature selection based on conditional mutual information: minimum conditional relevance and minimum conditional redundancy,” Appl. Intell., vol. 49, no. 3, pp. 883–896, Mar. 2019, doi: 10.1007/s10489-018-1305-0.

[9] J. Wang, J.-M. Wei, Z. Yang, and S.-Q. Wang, “Feature Selection by Maximizing Independent Classification Information,” IEEE Trans. Knowl. Data Eng., vol. 29, no. 4, pp. 828–841, Apr. 2017, doi: 10.1109/TKDE.2017.2650906.

[10] P. Zhu, Q. Xu, Q. Hu, and C. Zhang, “Co-regularized unsupervised feature selection,” Neurocomputing, vol. 275, pp. 2855–2863, Jan. 2018, doi: 10.1016/j.neucom.2017.11.061.

[11] A. Zakeri and A. Hokmabadi, “Efficient feature selection method using real-valued grasshopper optimization algorithm,” Expert Syst. Appl., vol. 119, pp. 61–72, Apr. 2019, doi: 10.1016/j.eswa.2018.10.021.

[12] K.-C. Lin, J. C. Hung, and J. Wei, “Feature selection with modified lion’s algorithms and support vector machine for high-dimensional data,” Appl. Soft Comput., vol. 68, pp. 669–676, Jul. 2018, doi: 10.1016/j.asoc.2018.01.011.

[13] Z. Manbari, F. AkhlaghianTab, and C. Salavati, “Hybrid fast unsupervised feature selection for high-dimensional data,” Expert Syst. Appl., vol. 124, pp. 97–118, Jun. 2019, doi: 10.1016/j.eswa.2019.01.016.

[14] M. Moran and G. Gordon, “Curious Feature Selection,” Inf. Sci. (Ny)., vol. 485, pp. 42–54, Jun. 2019, doi: 10.1016/j.ins.2019.02.009.

[15] P. Drotár, M. Gazda, and L. Vokorokos, “Ensemble feature selection using election methods and ranker clustering,” Inf. Sci. (Ny)., vol. 480, pp. 365–380, Apr. 2019, doi: 10.1016/j.ins.2018.12.033.

[16] D. Panday, R. Cordeiro de Amorim, and P. Lane, “Feature weighting as a tool for unsupervised feature selection,” Inf. Process. Lett., vol. 129, pp. 44–52, Jan. 2018, doi: 10.1016/j.ipl.2017.09.005.

[17] H. Dong, T. Li, R. Ding, and J. Sun, “A novel hybrid genetic algorithm with granular information for feature selection and optimization,” Appl. Soft Comput., vol. 65, pp. 33–46, Apr. 2018, doi: 10.1016/j.asoc.2017.12.048.

[18] J. L. Speiser, M. E. Miller, J. Tooze, and E. Ip, “A comparison of random forest variable selection methods for classification prediction modeling,” Expert Syst. Appl., vol. 134, pp. 93–101, Nov. 2019, doi: 10.1016/j.eswa.2019.05.028.

[19] F. Degenhardt, S. Seifert, and S. Szymczak, “Evaluation of variable selection methods for random forests and omics data sets,” Brief. Bioinform., vol. 20, no. 2, pp. 492–503, Mar. 2019, doi: 10.1093/bib/bbx124.

[20] D. Amaratunga, J. Cabrera, and Y.-S. Lee, “Enriched random forests,” Bioinformatics, vol. 24, no. 18, pp. 2010–2014, Sep. 2008, doi: 10.1093/bioinformatics/btn356.

[21] M. Lu, “Embedded feature selection accounting for unknown data heterogeneity,” Expert Syst. Appl., vol. 119, pp. 350–361, Apr. 2019, doi: 10.1016/j.eswa.2018.11.006.

[22] C. O. Sakar et al., “A comparative analysis of speech signal processing algorithms for Parkinson’s disease classification and the use of the tunable Q-factor wavelet transform,” Appl. Soft Comput., vol. 74, pp. 255–263, Jan. 2019, doi: 10.1016/j.asoc.2018.10.022.

[23] B. Johnson and Z. Xie, “Classifying a high resolution image of an urban area using super-object information,” ISPRS J. Photogramm. Remote Sens., vol. 83, pp. 40–49, Sep. 2013, doi: 10.1016/j.isprsjprs.2013.05.008.

[24] B. A. Johnson, “High-resolution urban land-cover classification using a competitive multi-scale object-based approach,” Remote Sens. Lett., vol. 4, no. 2, pp. 131–140, Feb. 2013, doi: 10.1080/2150704X.2012.705440.

[25] G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning, vol. 103. New York, NY: Springer New York, 2013, doi: 10.1007/978-1-4614-7138-7

[26] M. L. Bermingham et al., “Application of high-dimensional feature selection: evaluation for genomic prediction in man,” Sci. Rep., vol. 5, no. 1, p. 10312, Sep. 2015, doi: 10.1038/srep10312.

[27] L. Yu and H. Liu, “Feature selection for high-dimensional data: A fast correlation-based filter solution,” in Proceedings of the 20th international conference on machine learning (ICML-03), 2003, pp. 856–863, Available at: Google Scholar.

[28] J. Wang, Z. Feng, N. Lu, and J. Luo, “Toward optimal feature and time segment selection by divergence method for EEG signals classification,” Comput. Biol. Med., vol. 97, pp. 161–170, Jun. 2018, doi: 10.1016/j.compbiomed.2018.04.022.

[29] D. Bansal, R. Chhikara, K. Khanna, and P. Gupta, “Comparative Analysis of Various Machine Learning Algorithms for Detecting Dementia,” Procedia Comput. Sci., vol. 132, pp. 1497–1502, 2018, doi: 10.1016/j.procs.2018.05.102.

[30] X. Li, H. Wang, B. Gu, and C. X. Ling, “The convergence of linear classifiers on large sparse data,” Neurocomputing, vol. 273, pp. 622–633, Jan. 2018, doi: 10.1016/j.neucom.2017.08.045.

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

International Journal of Advances in Intelligent Informatics
ISSN 2442-6571  (print) | 2548-3161 (online)
Organized by Informatics Department - Universitas Ahmad Dahlan,  UTM Big Data Centre - Universiti Teknologi Malaysia, and ASCEE Computer Society
Published by Universitas Ahmad Dahlan
W: http://ijain.org
E: ijain@uad.ac.id (paper handling issues)
    info@ijain.org, andri.pranolo.id@ieee.org (publication issues)

View IJAIN Stats

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0