Feature selection to increase the random forest method performance on high dimensional data

Maria Irmina Prasetiyowati; Nur Ulfa Maulidevi; Kridanto Surendro

doi:10.26555/ijain.v6i3.471


Feature selection to increase the random forest method performance on high dimensional data

^{(1) *} Maria Irmina Prasetiyowati

(Institut Teknologi Bandung, Indonesia)
⁽²⁾ Nur Ulfa Maulidevi

(Institut Teknologi Bandung, Indonesia)
⁽³⁾ Kridanto Surendro

(Institut Teknologi Bandung, Indonesia)
^*corresponding author

Abstract

Random Forest is a supervised classification method based on bagging (Bootstrap aggregating) Breiman and random selection of features. The choice of features randomly assigned to the Random Forest makes it possible that the selected feature is not necessarily informative. So it is necessary to select features in the Random Forest. The purpose of choosing this feature is to select an optimal subset of features that contain valuable information in the hope of accelerating the performance of the Random Forest method. Mainly for the execution of high-dimensional datasets such as the Parkinson, CNAE-9, and Urban Land Cover dataset. The feature selection is done using the Correlation-Based Feature Selection method, using the BestFirst method. Tests were carried out 30 times using the K-Cross Fold Validation value of 10 and dividing the dataset into 70% training and 30% testing. The experiments using the Parkinson dataset obtained a time difference of 0.27 and 0.28 seconds faster than using the Random Forest method without feature selection. Likewise, the trials in the Urban Land Cover dataset had 0.04 and 0.03 seconds, while for the CNAE-9 dataset, the difference time was 2.23 and 2.81 faster than using the Random Forest method without feature selection. These experiments showed that the Random Forest processes are faster when using the first feature selection. Likewise, the accuracy value increased in the two previous experiments, while only the CNAE-9 dataset experiment gets a lower accuracy. This researchâ€™s benefits is by first performing feature selection steps using the Correlation-Base Feature Selection method can increase the speed of performance and accuracy of the Random Forest method on high-dimensional data.

Keywords

Random forest; Feature selection; BestFirst method; High dimensional data; CNAE-9 dataset

DOI

https://doi.org/10.26555/ijain.v6i3.471

Article metrics

Abstract views : 5725 | PDF views : 466

Cite

How to cite item

Full Text

Download

References

[1] C. Hu, Y. Chen, L. Hu, and X. Peng, â€œA novel random forests based class incremental learning method for activity recognition,â€ Pattern Recognit., vol. 78, pp. 277â€“290, Jun. 2018, doi: 10.1016/j.patcog.2018.01.025.

[2] E. Scornet, G. Biau, and J.-P. Vert, â€œConsistency of random forests,â€ Ann. Stat., vol. 43, no. 4, pp. 1716â€“1741, Aug. 2015, doi: 10.1214/15-AOS1321.

[3] D. Talreja, J. Nagaraj, N. J. Varsha, and K. Mahesh, â€œTerrorism analytics: Learning to predict the perpetrator,â€ in 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), 2017, pp. 1723â€“1726, doi: 10.1109/ICACCI.2017.8126092.

[4] L. Breiman, â€œRandom forests,â€ Mach. Learn., vol. 45, no. 1, pp. 5â€“32, 2001, doi: 10.1023/A:1010933404324

[5] Y. Ye, Q. Wu, J. Zhexue Huang, M. K. Ng, and X. Li, â€œStratified sampling for feature subspace selection in random forests for high dimensional data,â€ Pattern Recognit., vol. 46, no. 3, pp. 769â€“787, Mar. 2013, doi: 10.1016/j.patcog.2012.09.005.

[6] J. Cai, J. Luo, S. Wang, and S. Yang, â€œFeature selection in machine learning: A new perspective,â€ Neurocomputing, vol. 300, pp. 70â€“79, Jul. 2018, doi: 10.1016/j.neucom.2017.11.077.

[7] P. MartÃn-Smith, J. Ortega, J. Asensio-Cubero, J. Q. Gan, and A. Ortiz, â€œA supervised filter method for multi-objective feature selection in EEG classification based on multi-resolution analysis for BCI,â€ Neurocomputing, vol. 250, pp. 45â€“56, Aug. 2017, doi: 10.1016/j.neucom.2016.09.123.

[8] H. Zhou, Y. Zhang, Y. Zhang, and H. Liu, â€œFeature selection based on conditional mutual information: minimum conditional relevance and minimum conditional redundancy,â€ Appl. Intell., vol. 49, no. 3, pp. 883â€“896, Mar. 2019, doi: 10.1007/s10489-018-1305-0.

[9] J. Wang, J.-M. Wei, Z. Yang, and S.-Q. Wang, â€œFeature Selection by Maximizing Independent Classification Information,â€ IEEE Trans. Knowl. Data Eng., vol. 29, no. 4, pp. 828â€“841, Apr. 2017, doi: 10.1109/TKDE.2017.2650906.

[10] P. Zhu, Q. Xu, Q. Hu, and C. Zhang, â€œCo-regularized unsupervised feature selection,â€ Neurocomputing, vol. 275, pp. 2855â€“2863, Jan. 2018, doi: 10.1016/j.neucom.2017.11.061.

[11] A. Zakeri and A. Hokmabadi, â€œEfficient feature selection method using real-valued grasshopper optimization algorithm,â€ Expert Syst. Appl., vol. 119, pp. 61â€“72, Apr. 2019, doi: 10.1016/j.eswa.2018.10.021.

[12] K.-C. Lin, J. C. Hung, and J. Wei, â€œFeature selection with modified lionâ€™s algorithms and support vector machine for high-dimensional data,â€ Appl. Soft Comput., vol. 68, pp. 669â€“676, Jul. 2018, doi: 10.1016/j.asoc.2018.01.011.

[13] Z. Manbari, F. AkhlaghianTab, and C. Salavati, â€œHybrid fast unsupervised feature selection for high-dimensional data,â€ Expert Syst. Appl., vol. 124, pp. 97â€“118, Jun. 2019, doi: 10.1016/j.eswa.2019.01.016.

[14] M. Moran and G. Gordon, â€œCurious Feature Selection,â€ Inf. Sci. (Ny)., vol. 485, pp. 42â€“54, Jun. 2019, doi: 10.1016/j.ins.2019.02.009.

[15] P. DrotÃ¡r, M. Gazda, and L. Vokorokos, â€œEnsemble feature selection using election methods and ranker clustering,â€ Inf. Sci. (Ny)., vol. 480, pp. 365â€“380, Apr. 2019, doi: 10.1016/j.ins.2018.12.033.

[16] D. Panday, R. Cordeiro de Amorim, and P. Lane, â€œFeature weighting as a tool for unsupervised feature selection,â€ Inf. Process. Lett., vol. 129, pp. 44â€“52, Jan. 2018, doi: 10.1016/j.ipl.2017.09.005.

[17] H. Dong, T. Li, R. Ding, and J. Sun, â€œA novel hybrid genetic algorithm with granular information for feature selection and optimization,â€ Appl. Soft Comput., vol. 65, pp. 33â€“46, Apr. 2018, doi: 10.1016/j.asoc.2017.12.048.

[18] J. L. Speiser, M. E. Miller, J. Tooze, and E. Ip, â€œA comparison of random forest variable selection methods for classification prediction modeling,â€ Expert Syst. Appl., vol. 134, pp. 93â€“101, Nov. 2019, doi: 10.1016/j.eswa.2019.05.028.

[19] F. Degenhardt, S. Seifert, and S. Szymczak, â€œEvaluation of variable selection methods for random forests and omics data sets,â€ Brief. Bioinform., vol. 20, no. 2, pp. 492â€“503, Mar. 2019, doi: 10.1093/bib/bbx124.

[20] D. Amaratunga, J. Cabrera, and Y.-S. Lee, â€œEnriched random forests,â€ Bioinformatics, vol. 24, no. 18, pp. 2010â€“2014, Sep. 2008, doi: 10.1093/bioinformatics/btn356.

[21] M. Lu, â€œEmbedded feature selection accounting for unknown data heterogeneity,â€ Expert Syst. Appl., vol. 119, pp. 350â€“361, Apr. 2019, doi: 10.1016/j.eswa.2018.11.006.

[22] C. O. Sakar et al., â€œA comparative analysis of speech signal processing algorithms for Parkinsonâ€™s disease classification and the use of the tunable Q-factor wavelet transform,â€ Appl. Soft Comput., vol. 74, pp. 255â€“263, Jan. 2019, doi: 10.1016/j.asoc.2018.10.022.

[23] B. Johnson and Z. Xie, â€œClassifying a high resolution image of an urban area using super-object information,â€ ISPRS J. Photogramm. Remote Sens., vol. 83, pp. 40â€“49, Sep. 2013, doi: 10.1016/j.isprsjprs.2013.05.008.

[24] B. A. Johnson, â€œHigh-resolution urban land-cover classification using a competitive multi-scale object-based approach,â€ Remote Sens. Lett., vol. 4, no. 2, pp. 131â€“140, Feb. 2013, doi: 10.1080/2150704X.2012.705440.

[25] G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning, vol. 103. New York, NY: Springer New York, 2013, doi: 10.1007/978-1-4614-7138-7

[26] M. L. Bermingham et al., â€œApplication of high-dimensional feature selection: evaluation for genomic prediction in man,â€ Sci. Rep., vol. 5, no. 1, p. 10312, Sep. 2015, doi: 10.1038/srep10312.

[27] L. Yu and H. Liu, â€œFeature selection for high-dimensional data: A fast correlation-based filter solution,â€ in Proceedings of the 20th international conference on machine learning (ICML-03), 2003, pp. 856â€“863, Available at: Google Scholar.

[28] J. Wang, Z. Feng, N. Lu, and J. Luo, â€œToward optimal feature and time segment selection by divergence method for EEG signals classification,â€ Comput. Biol. Med., vol. 97, pp. 161â€“170, Jun. 2018, doi: 10.1016/j.compbiomed.2018.04.022.

[29] D. Bansal, R. Chhikara, K. Khanna, and P. Gupta, â€œComparative Analysis of Various Machine Learning Algorithms for Detecting Dementia,â€ Procedia Comput. Sci., vol. 132, pp. 1497â€“1502, 2018, doi: 10.1016/j.procs.2018.05.102.

[30] X. Li, H. Wang, B. Gu, and C. X. Ling, â€œThe convergence of linear classifiers on large sparse data,â€ Neurocomputing, vol. 273, pp. 622â€“633, Jan. 2018, doi: 10.1016/j.neucom.2017.08.045.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

___________________________________________________________
International Journal of Advances in Intelligent Informatics
ISSN 2442-6571 (print) | 2548-3161 (online)
Organized by UAD and ASCEE Computer Society
Published by Universitas Ahmad Dahlan
W: http://ijain.org
E: info@ijain.org (paper handling issues)
andri.pranolo.id@ieee.org (publication issues)

View IJAIN Stats

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0

Username
Password
Remember me