Modified balanced random forest for improving imbalanced data prediction

Zahra Putri Agusta; Adiwijaya Adiwijaya

doi:10.26555/ijain.v5i1.255


Modified balanced random forest for improving imbalanced data prediction

^{(1) *} Zahra Putri Agusta

(Surya University, Indonesia)
⁽²⁾ Adiwijaya Adiwijaya

(Telkom University, Indonesia)
^*corresponding author

Abstract

This paper proposes a Modified Balanced Random Forest (MBRF) algorithm as a classification technique to address imbalanced data. The MBRF process changes the process in a Balanced Random Forest by applying an under-sampling strategy based on clustering techniques for each data bootstrap decision tree in the Random Forest algorithm. To find the optimal performance of our proposed method compared with four clustering techniques, like: K-MEANS, Spectral Clustering, Agglomerative Clustering, and Ward Hierarchical Clustering. The experimental result show the Ward Hierarchical Clustering Technique achieved optimal performance, also the proposed MBRF method yielded better performance compared to the Balanced Random Forest (BRF) and Random Forest (RF) algorithms, with a sensitivity value or true positive rate (TPR) of 93.42%, a specificity or true negative rate (TNR) of 93.60%, and the best AUC accuracy value of 93.51%. Moreover, MBRF also reduced process running time.

Keywords

Imbalanced data; Random forest algorithm; Balanced random forest ; Customer churn; Classification technique

DOI

https://doi.org/10.26555/ijain.v5i1.255

Article metrics

Abstract views : 10687 | PDF views : 627

Cite

How to cite item

Full Text

Download

References

[1] S. Singh and P. Gupta, â€œComparative study ID3, cart and C4 . 5 Decision tree algorithm: a survey,â€ Int. J. Adv. Inf. Sci. Technol., 2014, doi: 10.15693/ijaist/2014.v3i7.47-52.

[2] L. Breiman, â€œRandom forests,â€ Mach. Learn., vol. 45, no. 1, pp. 5-32, 2001, doi: 10.1023/A:1010933404324.

[3] H. Aydadenta and Adiwijaya, â€œA Clustering Approach for Feature Selection in Microarray Data Classification Using Random Forest,â€ J. Inf. Process. Syst., vol. 14, no. 5, pp. 1167â€“1175, 2018, doi: 10.3745/JIPS.04.0087.

[4] G. Esteves and J. Mendes-Moreira, â€œChurn perdiction in the telecom business,â€ in 2016 11th International Conference on Digital Information Management, ICDIM 2016, 2016, doi: 10.1109/ICDIM.2016.7829775.

[5] A. Sonak and R. A. Patankar, â€œA Survey on Methods to Handle Imbalance Dataset,â€ Int. J. Comput. Sci. Mob. Comput., vol. 4, no. 11, pp. 338â€“343, 2015, available at : Google Scholar.

[6] A. Ali, S. M. Shamsuddin, and A. L. Ralescu, â€œClassification with class imbalance problem: A review,â€ Int. J. Adv. Soft Comput. its Appl., vol. 7, no. 3, pp. 176-203, 2015, available at: http://home.ijasca.com/data/documents/13IJASCA-070301_Pg176-204_Classification-with-class-imbalance-problem_A-Review.pdf .

[7] S. Du, F. Zhang, and X. Zhang, â€œSemantic classification of urban buildings combining VHR image and GIS data: An improved random forest approach,â€ ISPRS J. Photogramm. Remote Sens., 2015, doi: 10.1016/j.isprsjprs.2015.03.011.

[8] Z. Wu, W. Lin, Z. Zhang, A. Wen, and L. Lin, â€œAn Ensemble Random Forest Algorithm for Insurance Big Data Analysis,â€ in Proceedings - 2017 IEEE International Conference on Computational Science and Engineering and IEEE/IFIP International Conference on Embedded and Ubiquitous Computing, CSE and EUC 2017, 2017, doi: 10.1109/CSE-EUC.2017.99.

[9] M. Khalilia, S. Chakraborty, and M. Popescu, â€œPredicting disease risks from highly imbalanced data using random forest,â€ BMC Med. Inform. Decis. Mak., 2011, doi: 10.1186/1472-6947-11-51.

[10] V. Effendy and Z. K. a. Baizal, â€œHandling imbalanced data in customer churn prediction using combined sampling and weighted random forest,â€ 2014 2nd Int. Conf. Inf. Commun. Technol., 2014, doi: 10.1109/ICoICT.2014.6914086.

[11] E. Dwiyanti, Adiwijaya, and A. Ardiyanti, â€œHandling imbalanced data in churn prediction using RUSBoost and feature selection (Case study: PT. Telekomunikasi Indonesia regional 7),â€ in Advances in Intelligent Systems and Computing, 2017, doi: 10.1007/978-3-319-51281-5_38.

[12] Å. KobyliÅ„ski and A. PrzepiÃ³rkowski, â€œDefinition extraction with balanced random forests,â€ in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2008, doi: 10.1007/978-3-540-85287-2_23.

[13] S. Ghosh and S. Kumar, â€œComparative Analysis of K-Means and Fuzzy C-Means Algorithms,â€ Int. J. Adv. Comput. Sci. Appl., 2013, doi: 10.14569/IJACSA.2013.040406.

[14] S. Venkateswara and V. Swamy, â€œA Survey : Spectral Clustering Applications and its Enhancements,â€ Int. J. Comput. Sci. Inf. Technol., vol. 6, no. 1, pp. 185â€“189, 2015, available at: Google Scholar.

[15] A. Y. Shelestov, â€œUsing the agglomerative method of hierarchical clustering as a data mining tool in capital market,â€ Int. J. "Information Theor. Appl., vol. 15, no. 1, pp. 382â€“386, 2018, available at: http://hdl.handle.net/10525/80.

[16] K. Sasirekha and P. Baby, â€œAgglomerative Hierarchical Clustering Algorithm-A Review,â€ Int. J. Sci. Res. Publ., 2013, doi: 10.1016/S0090-3019(03)00579-2.

[17] W. Tian, Y. Zheng, R. Yang, S. Ji, and J. Wang, â€œA Survey on Clustering based Meteorological Data Mining,â€ Int. J. Grid Distrib. Comput., vol. 7, no. 6, pp. 229â€“240, 2014, available at: Google Scholar.

[18] A. Chowdhary, â€œCommunity Detection:Hierarchical clustering Algorithms,â€ Int. J. Creat. Res. Thoughts, vol. 5, no. 4, pp. 2320â€“2882, 2017, available at: http://ijcrt.org/papers/IJCRT1704418.pdf.

[19] C. Chen, A. Liaw, and L. Breiman, â€œUsing random forest to learn imbalanced data,â€ Univ. California, Berkeley, 2004, available at: https://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf.

[20] D. Ramyachitra and P. Manikandan, â€œImbalanced Dataset Classification and Solutions: a Review,â€ Int. J. Comput. Bus. Res., vol. 5, no. 4, 2014, available at: http://www.researchmanuscripts.com/July2014/2.pdf.

[21] S. Sardari, M. Eftekhari, and F. Afsari, â€œHesitant fuzzy decision tree approach for highly imbalanced data classification,â€ Appl. Soft Comput. J., 2017, doi: 10.1016/j.asoc.2017.08.052.

[22] E. AT, A. M, A.-M. F, and S. M, â€œClassification of Imbalance Data using Tomek Link (T-Link) Combined with Random Under-sampling (RUS) as a Data Reduction Method,â€ Glob. J. Technol. Optim., 2018, doi: 10.4172/2229-8711.s1111.

[23] M. Bekkar, H. K. Djemaa, and T. A. Alitouche, â€œEvaluation measures for models assessment over imbalanced data sets,â€ J. Inf. Eng. Appl., vol. 3, no. 10, pp. 27-38, 2013, available at: Google Scholar.

[24] C. G. Weng and J. Poon, â€œA new evaluation measure for imbalanced datasets,â€ Proceedings of the 7th Australasian Data Mining Conference., vol. 87, no. 6, pp. 27-32, 2008, available at: http://dl.acm.org/citation.cfm?id=2449288.2449295.

[25] J. S. Akosa, â€œPredictive Accuracy : A Misleading Performance Measure for Highly Imbalanced Data,â€ SAS Glob. Forum, 2017, available at: Google Scholar.

[26] Y. Zhang and D. Wang, â€œA Cost-Sensitive Ensemble Method for Class-Imbalanced Datasets,â€ Abstr. Appl. Anal., 2013, doi: 10.1155/2013/196256.

[27] T. Fawcett, â€œAn introduction to ROC analysis,â€ Pattern Recognit. Lett., 2006, doi: 10.1016/j.patrec.2005.10.010.

[28] H. M and S. M.N, â€œA Review on Evaluation Metrics for Data Classification Evaluations,â€ Int. J. Data Min. Knowl. Manag. Process, 2015, doi: 10.5121/ijdkp.2015.5201.

[29] A. K. Santra and C. J. Christy, â€œGenetic Algorithm and Confusion Matrix for Document Clustering,â€ IJCSI Int. J. Comput. Sci. Issues, 2012, available at: Google Scholar.

[30] J. Pohjankukka, T. Pahikkala, P. Nevalainen, and J. Heikkonen, â€œEstimating the prediction performance of spatial models via spatial k-fold cross validation,â€ Int. J. Geogr. Inf. Sci., 2017, doi: 10.1080/13658816.2017.1346255.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

___________________________________________________________
International Journal of Advances in Intelligent Informatics
ISSN 2442-6571 (print) | 2548-3161 (online)
Organized by UAD and ASCEE Computer Society
Published by Universitas Ahmad Dahlan
W: http://ijain.org
E: info@ijain.org (paper handling issues)
andri.pranolo.id@ieee.org (publication issues)

View IJAIN Stats

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0

Username
Password
Remember me