Modified balanced random forest for improving imbalanced data prediction

(1) * Zahra Putri Agusta Mail (Surya University, Indonesia)
(2) Adiwijaya Adiwijaya Mail (Telkom University, Indonesia)
*corresponding author

Abstract


This paper proposes a Modified Balanced Random Forest (MBRF) algorithm as a classification technique to address imbalanced data. The MBRF process changes the process in a Balanced Random Forest by applying an under-sampling strategy based on clustering techniques for each data bootstrap decision tree in the Random Forest algorithm. To find the optimal performance of our proposed method compared with four clustering techniques, like: K-MEANS, Spectral Clustering, Agglomerative Clustering, and Ward Hierarchical Clustering. The experimental result show the Ward Hierarchical Clustering Technique achieved optimal performance, also the proposed MBRF method yielded better performance compared to the Balanced Random Forest (BRF) and Random Forest (RF) algorithms, with a sensitivity value or true positive rate (TPR) of 93.42%, a specificity or true negative rate (TNR) of 93.60%, and the best AUC accuracy value of 93.51%. Moreover, MBRF also reduced process running time.

Keywords


Imbalanced data; Random forest algorithm; Balanced random forest ; Customer churn; Classification technique

   

DOI

https://doi.org/10.26555/ijain.v5i1.255
      

Article metrics

Abstract views : 8102 | PDF views : 480

   

Cite

   

Full Text

Download

References


[1] S. Singh and P. Gupta, “Comparative study ID3, cart and C4 . 5 Decision tree algorithm: a survey,” Int. J. Adv. Inf. Sci. Technol., 2014, doi: 10.15693/ijaist/2014.v3i7.47-52.

[2] L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5-32, 2001, doi: 10.1023/A:1010933404324.

[3] H. Aydadenta and Adiwijaya, “A Clustering Approach for Feature Selection in Microarray Data Classification Using Random Forest,” J. Inf. Process. Syst., vol. 14, no. 5, pp. 1167–1175, 2018, doi: 10.3745/JIPS.04.0087.

[4] G. Esteves and J. Mendes-Moreira, “Churn perdiction in the telecom business,” in 2016 11th International Conference on Digital Information Management, ICDIM 2016, 2016, doi: 10.1109/ICDIM.2016.7829775.

[5] A. Sonak and R. A. Patankar, “A Survey on Methods to Handle Imbalance Dataset,” Int. J. Comput. Sci. Mob. Comput., vol. 4, no. 11, pp. 338–343, 2015, available at : Google Scholar.

[6] A. Ali, S. M. Shamsuddin, and A. L. Ralescu, “Classification with class imbalance problem: A review,” Int. J. Adv. Soft Comput. its Appl., vol. 7, no. 3, pp. 176-203, 2015, available at: http://home.ijasca.com/data/documents/13IJASCA-070301_Pg176-204_Classification-with-class-imbalance-problem_A-Review.pdf .

[7] S. Du, F. Zhang, and X. Zhang, “Semantic classification of urban buildings combining VHR image and GIS data: An improved random forest approach,” ISPRS J. Photogramm. Remote Sens., 2015, doi: 10.1016/j.isprsjprs.2015.03.011.

[8] Z. Wu, W. Lin, Z. Zhang, A. Wen, and L. Lin, “An Ensemble Random Forest Algorithm for Insurance Big Data Analysis,” in Proceedings - 2017 IEEE International Conference on Computational Science and Engineering and IEEE/IFIP International Conference on Embedded and Ubiquitous Computing, CSE and EUC 2017, 2017, doi: 10.1109/CSE-EUC.2017.99.

[9] M. Khalilia, S. Chakraborty, and M. Popescu, “Predicting disease risks from highly imbalanced data using random forest,” BMC Med. Inform. Decis. Mak., 2011, doi: 10.1186/1472-6947-11-51.

[10] V. Effendy and Z. K. a. Baizal, “Handling imbalanced data in customer churn prediction using combined sampling and weighted random forest,” 2014 2nd Int. Conf. Inf. Commun. Technol., 2014, doi: 10.1109/ICoICT.2014.6914086.

[11] E. Dwiyanti, Adiwijaya, and A. Ardiyanti, “Handling imbalanced data in churn prediction using RUSBoost and feature selection (Case study: PT. Telekomunikasi Indonesia regional 7),” in Advances in Intelligent Systems and Computing, 2017, doi: 10.1007/978-3-319-51281-5_38.

[12] Ł. Kobyliński and A. Przepiórkowski, “Definition extraction with balanced random forests,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2008, doi: 10.1007/978-3-540-85287-2_23.

[13] S. Ghosh and S. Kumar, “Comparative Analysis of K-Means and Fuzzy C-Means Algorithms,” Int. J. Adv. Comput. Sci. Appl., 2013, doi: 10.14569/IJACSA.2013.040406.

[14] S. Venkateswara and V. Swamy, “A Survey : Spectral Clustering Applications and its Enhancements,” Int. J. Comput. Sci. Inf. Technol., vol. 6, no. 1, pp. 185–189, 2015, available at: Google Scholar.

[15] A. Y. Shelestov, “Using the agglomerative method of hierarchical clustering as a data mining tool in capital market,” Int. J. "Information Theor. Appl., vol. 15, no. 1, pp. 382–386, 2018, available at: http://hdl.handle.net/10525/80.

[16] K. Sasirekha and P. Baby, “Agglomerative Hierarchical Clustering Algorithm-A Review,” Int. J. Sci. Res. Publ., 2013, doi: 10.1016/S0090-3019(03)00579-2.

[17] W. Tian, Y. Zheng, R. Yang, S. Ji, and J. Wang, “A Survey on Clustering based Meteorological Data Mining,” Int. J. Grid Distrib. Comput., vol. 7, no. 6, pp. 229–240, 2014, available at: Google Scholar.

[18] A. Chowdhary, “Community Detection:Hierarchical clustering Algorithms,” Int. J. Creat. Res. Thoughts, vol. 5, no. 4, pp. 2320–2882, 2017, available at: http://ijcrt.org/papers/IJCRT1704418.pdf.

[19] C. Chen, A. Liaw, and L. Breiman, “Using random forest to learn imbalanced data,” Univ. California, Berkeley, 2004, available at: https://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf.

[20] D. Ramyachitra and P. Manikandan, “Imbalanced Dataset Classification and Solutions: a Review,” Int. J. Comput. Bus. Res., vol. 5, no. 4, 2014, available at: http://www.researchmanuscripts.com/July2014/2.pdf.

[21] S. Sardari, M. Eftekhari, and F. Afsari, “Hesitant fuzzy decision tree approach for highly imbalanced data classification,” Appl. Soft Comput. J., 2017, doi: 10.1016/j.asoc.2017.08.052.

[22] E. AT, A. M, A.-M. F, and S. M, “Classification of Imbalance Data using Tomek Link (T-Link) Combined with Random Under-sampling (RUS) as a Data Reduction Method,” Glob. J. Technol. Optim., 2018, doi: 10.4172/2229-8711.s1111.

[23] M. Bekkar, H. K. Djemaa, and T. A. Alitouche, “Evaluation measures for models assessment over imbalanced data sets,” J. Inf. Eng. Appl., vol. 3, no. 10, pp. 27-38, 2013, available at: Google Scholar.

[24] C. G. Weng and J. Poon, “A new evaluation measure for imbalanced datasets,” Proceedings of the 7th Australasian Data Mining Conference., vol. 87, no. 6, pp. 27-32, 2008, available at: http://dl.acm.org/citation.cfm?id=2449288.2449295.

[25] J. S. Akosa, “Predictive Accuracy : A Misleading Performance Measure for Highly Imbalanced Data,” SAS Glob. Forum, 2017, available at: Google Scholar.

[26] Y. Zhang and D. Wang, “A Cost-Sensitive Ensemble Method for Class-Imbalanced Datasets,” Abstr. Appl. Anal., 2013, doi: 10.1155/2013/196256.

[27] T. Fawcett, “An introduction to ROC analysis,” Pattern Recognit. Lett., 2006, doi: 10.1016/j.patrec.2005.10.010.

[28] H. M and S. M.N, “A Review on Evaluation Metrics for Data Classification Evaluations,” Int. J. Data Min. Knowl. Manag. Process, 2015, doi: 10.5121/ijdkp.2015.5201.

[29] A. K. Santra and C. J. Christy, “Genetic Algorithm and Confusion Matrix for Document Clustering,” IJCSI Int. J. Comput. Sci. Issues, 2012, available at: Google Scholar.

[30] J. Pohjankukka, T. Pahikkala, P. Nevalainen, and J. Heikkonen, “Estimating the prediction performance of spatial models via spatial k-fold cross validation,” Int. J. Geogr. Inf. Sci., 2017, doi: 10.1080/13658816.2017.1346255.




Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

___________________________________________________________
International Journal of Advances in Intelligent Informatics
ISSN 2442-6571  (print) | 2548-3161 (online)
Organized by UAD and ASCEE Computer Society
Published by Universitas Ahmad Dahlan
W: http://ijain.org
E: info@ijain.org (paper handling issues)
   andri.pranolo.id@ieee.org (publication issues)

View IJAIN Stats

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0