Boosting and bagging classification for computer science journal

(1) * Nastiti Susetyo Fanany Putri Mail (Universitas Negeri Malang, Indonesia)
(2) Aji Prasetya Wibawa Mail (Universitas Negeri Malang, Indonesia)
(3) Harits Ar Rasyid Mail (Universitas Negeri Malang, Indonesia)
(4) Andrew Nafalski Mail (University of South Australia, Australia)
(5) Ummi Rabaah Hasyim Mail (Univeriti Teknikal Malaysia Melaka Melaka, Malaysia)
*corresponding author

Abstract


In recent years, data processing has become an issue across all disciplines. Good data processing can provide decision-making recommendations. Data processing is covered in academic data processing publications, including those in computer science. This topic has grown over the past three years, demonstrating that data processing is expanding and diversifying, and there is a great deal of interest in this area of study. Within the journal, groupings (quartiles) indicate the journal's influence on other similar studies. SCImago provides this category. There are four quartiles, with the highest quartile being 1 and the lowest being 4. There are, however, numerous differences in class quartiles, with different quartile values for the same journal in different disciplines. Therefore, a method of categorization is provided to solve this issue. Classification is a machine-learning technique that groups data based on the supplied label class. Ensemble Boosting and Bagging with Decision Tree (DT) and Gaussian Nave Bayes (GNB) were utilized in this study. Several modifications were made to the ensemble algorithm's depth and estimator settings to examine the influence of adding values on the resultant precision. In the DT algorithm, both variables are altered, whereas, in the GNB algorithm, just the estimator's value is modified. Based on the average value of the accuracy results, it is known that the best algorithm for computer science datasets is GNB Bagging, with values of 68.96%, 70.99%, and 69.05%. Second-place XGBDT has 67.75% accuracy, 67.69% precision, and 67.83 recall. The DT Bagging method placed third with 67.31 percent recall, 68.13 percent precision, and 67.30 percent accuracy. The fourth sequence is the XGBoost GNB approach, which has an accuracy of 67.07%, a precision of 68.85%, and a recall of 67.18%. The Adaboost DT technique ranks in the fifth position with an accuracy of 63.65%, a precision of 64.21 %, and a recall of 63.63 %. Adaboost GNB is the least efficient algorithm for this dataset since it only achieves 43.19 % accuracy, 48.14 % precision, and 43.2% recall. The results are still quite far from the ideal. Hence the proposed method for journal quartile inequality issues is not advised.

Keywords


Ensemble Learning, Boosting, Bagging, Decision Tree, Gaussian Naive Bayes, SCImago Journal Rank

   

DOI

https://doi.org/10.26555/ijain.v9i1.985
      

Article metrics

Abstract views : 544 | PDF views : 249

   

Cite

   

Full Text

Download

References


[1] J. Vom Brocke, R. Winter, A. Hevner, and A. Maedche, “Special issue editorial – accumulation and evolution of design knowledge in design science research: A journey through time and space,” J. Assoc. Inf. Syst., vol. 21, no. 3, pp. 520–544, 2020, doi: 10.17705/1jais.00611.

[2] A. P. Wibawa et al., “Naïve Bayes Classifier for Journal Quartile Classification,” Int. J. Recent Contrib. from Eng. Sci. IT, vol. 7, no. 2, p. 91, 2019, doi: 10.3991/ijes.v7i2.10659.

[3] M. Sabokrou, M. Fathy, G. Zhao, and E. Adeli, “Deep End-to-End One-Class Classifier,” IEEE Trans. Neural Networks Learn. Syst., vol. 32, no. 2, pp. 675–684, 2021, doi: 10.1109/TNNLS.2020.2979049.

[4] M. Abdar et al., “A new nested ensemble technique for automated diagnosis of breast cancer,” Pattern Recognit. Lett., vol. 132, pp. 123–131, 2020, doi: 10.1016/j.patrec.2018.11.004.

[5] J. Nalić, G. Martinović, and D. Žagar, “New hybrid data mining model for credit scoring based on feature selection algorithm and ensemble classifiers,” Adv. Eng. Informatics, vol. 45, p. 101130, 2020, doi: 10.1016/j.aei.2020.101130.

[6] J. L. Fernández-Alemán, J. M. Carrillo-de-Gea, M. Hosni, A. Idri, and G. García-Mateos, “Homogeneous and heterogeneous ensemble classification methods in diabetes disease: a review,” in 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), 2019, pp. 3956–3959, doi: 10.1109/EMBC.2019.8856341.

[7] Y. Himeur, A. Alsalemi, F. Bensaali, and A. Amira, “Robust event-based non-intrusive appliance recognition using multi-scale wavelet packet tree and ensemble bagging tree,” Appl. Energy, vol. 267, p. 114877, 2020, doi: 10.1016/j.apenergy.2020.114877.

[8] G. Tuysuzoglu and D. Birant, “Enhanced bagging (eBagging): A novel approach for ensemble learning,” Int. Arab J. Inf. Technol., vol. 17, no. 4, pp. 515–528, 2020, doi: 10.34028/iajit/17/4/10.

[9] X. Huang et al., “Ensemble-boosting effect of Ru-Cu alloy on catalytic activity towards hydrogen evolution in ammonia borane hydrolysis,” Appl. Catal. B Environ., vol. 287, p. 119960, 2021, doi: 10.1016/j.apcatb.2021.119960.

[10] A. Mosavi, F. Sajedi Hosseini, B. Choubin, M. Goodarzi, A. A. Dineva, and E. Rafiei Sardooi, “Ensemble Boosting and Bagging Based Machine Learning Models for Groundwater Potential Prediction,” Water Resour. Manag., vol. 35, no. 1, pp. 23–37, 2021, doi: 10.1007/s11269-020-02704-3.

[11] Y. Xiong, M. Ye, and C. Wu, “Cancer Classification with a Cost-Sensitive Naive Bayes Stacking Ensemble,” Comput. Math. Methods Med., vol. 2021, 2021, doi: 10.1155/2021/5556992.

[12] B. A. Hassan and T. A. Rashid, “A multidisciplinary ensemble algorithm for clustering heterogeneous datasets,” Neural Comput. Appl., vol. 33, no. 17, pp. 10987–11010, 2021, doi: 10.1007/s00521-020-05649-1.

[13] A. P. Wibawa, “International Journal Quartile Classification Using the K-Nearest Neighbor Method,” 2019, doi: 10.1109/ICEEIE47180.2019.8981413.

[14] K. Nahar, B. I. Shova, T. Ria, H. B. Rashid, and A. H. M. S. Islam, “Mining educational data to predict students performance,” Educ. Inf. Technol., vol. 26, no. 5, pp. 6051–6067, Sep. 2021, doi: 10.1007/s10639-021-10575-3.

[15] H. Benhar, A. Idri, and J. L. Fernández-Alemán, “Data preprocessing for heart disease classification: A systematic literature review,” Comput. Methods Programs Biomed., vol. 195, p. 105635, 2020, doi: 10.1016/j.cmpb.2020.105635.

[16] S.-A. N. Alexandropoulos, S. B. Kotsiantis, and M. N. Vrahatis, “Data preprocessing in predictive data mining,” Knowl. Eng. Rev., vol. 34, p. e1, 2019, doi: 10.1017/S026988891800036X.

[17] K. H. Tae, Y. Roh, Y. H. Oh, H. Kim, and S. E. Whang, “Data Cleaning for Accurate, Fair, and Robust Models: A Big Data - AI Integration Approach,” 2019, doi: 10.1145/3329486.3329493.

[18] C. B. C. Latha and S. C. Jeeva, “Improving the accuracy of prediction of heart disease risk based on ensemble classification techniques,” Informatics Med. Unlocked, vol. 16, p. 100203, 2019, doi: 10.1016/j.imu.2019.100203.

[19] R. Sambasivan, S. Das, and S. K. Sahu, “A Bayesian perspective of statistical machine learning for big data,” Comput. Stat., vol. 35, no. 3, pp. 893–930, Sep. 2020, doi: 10.1007/s00180-020-00970-8.

[20] H. Liu, R. Yang, T. Wang, and L. Zhang, “A hybrid neural network model for short-term wind speed forecasting based on decomposition, multi-learner ensemble, and adaptive multiple error corrections,” Renew. Energy, vol. 165, pp. 573–594, Mar. 2021, doi: 10.1016/j.renene.2020.11.002.

[21] H. Jafarzadeh, M. Mahdianpari, E. Gill, F. Mohammadimanesh, and S. Homayouni, “Bagging and Boosting Ensemble Classifiers for Classification of Multispectral, Hyperspectral and PolSAR Data: A Comparative Evaluation,” Remote Sens., vol. 13, no. 21, 2021, doi: 10.3390/rs13214405.

[22] B. So, J.-P. Boucher, and E. A. Valdez, “Cost-sensitive Multi-class AdaBoost for Understanding Driving Behavior with Telematics,” Jul. 2020, doi: 10.48550/arXiv.2007.03100.

[23] E. K. Sahin, “Assessing the predictive capability of ensemble tree methods for landslide susceptibility mapping using XGBoost, gradient boosting machine, and random forest,” SN Appl. Sci., vol. 2, no. 7, p. 1308, Jul. 2020, doi: 10.1007/s42452-020-3060-1.

[24] M. Ma et al., “XGBoost-based method for flash flood risk assessment,” J. Hydrol., vol. 598, p. 126382, Jul. 2021, doi: 10.1016/j.jhydrol.2021.126382.

[25] V. Patel, S. Choe, and T. Halabi, “Predicting Future Malware Attacks on Cloud Systems using Machine Learning,” in 2020 IEEE 6th Intl Conference on Big Data Security on Cloud (BigDataSecurity), IEEE Intl Conference on High Performance and Smart Computing, (HPSC) and IEEE Intl Conference on Intelligent Data and Security (IDS), May 2020, pp. 151–156, doi: 10.1109/BigDataSecurity-HPSC-IDS49724.2020.00036.

[26] F. Harrou, A. Saidi, and Y. Sun, “Wind power prediction using bootstrap aggregating trees approach to enabling sustainable wind power integration in a smart grid,” Energy Convers. Manag., vol. 201, p. 112077, 2019, doi: 10.1016/j.enconman.2019.112077.

[27] D. C. Yadav and S. Pal, “Prediction of thyroid disease using decision tree ensemble method,” Human-Intelligent Syst. Integr., vol. 2, no. 1–4, pp. 89–95, 2020, doi: 10.1007/s42454-020-00006-y.

[28] M. Ashraf, M. Zaman, and M. Ahmed, “An Intelligent Prediction System for Educational Data Mining Based on Ensemble and Filtering approaches,” Procedia Comput. Sci., vol. 167, pp. 1471–1483, 2020, doi: 10.1016/j.procs.2020.03.358.

[29] Z. P. Brodeur, J. D. Herman, and S. Steinschneider, “Bootstrap Aggregation and Cross‐Validation Methods to Reduce Overfitting in Reservoir Control Policy Search,” Water Resour. Res., vol. 56, no. 8, Aug. 2020, doi: 10.1029/2020WR027184.

[30] S. Boukir and W. Feng, “Boundary bagging to address training data issues in ensemble classification,” in 2020 25th International Conference on Pattern Recognition (ICPR), Jan. 2021, pp. 9975–9981, doi: 10.1109/ICPR48806.2021.9413055.

[31] H. Du and Y. Zhang, “Network anomaly detection based on selective ensemble algorithm,” J. Supercomput., vol. 77, no. 3, pp. 2875–2896, Mar. 2021, doi: 10.1007/s11227-020-03374-z.

[32] P. Melin, J. C. Monica, D. Sanchez, and O. Castillo, “Multiple Ensemble Neural Network Models with Fuzzy Response Aggregation for Predicting COVID-19 Time Series: The Case of Mexico,” Healthcare, vol. 8, no. 2, p. 181, Jun. 2020, doi: 10.3390/healthcare8020181.

[33] A. Theissler, M. Thomas, M. Burch, and F. Gerschner, “ConfusionVis: Comparative evaluation and selection of multi-class classifiers based on confusion matrices,” Knowledge-Based Syst., vol. 247, p. 108651, Jul. 2022, doi: 10.1016/j.knosys.2022.108651.




Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

___________________________________________________________
International Journal of Advances in Intelligent Informatics
ISSN 2442-6571  (print) | 2548-3161 (online)
Organized by UAD and ASCEE Computer Society
Published by Universitas Ahmad Dahlan
W: http://ijain.org
E: info@ijain.org (paper handling issues)
   andri.pranolo.id@ieee.org (publication issues)

View IJAIN Stats

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0