Optimization of data resampling through GA for the classification of imbalanced datasets

Filippo Galli; Marco Vannucci; Valentina Colla

doi:10.26555/ijain.v5i3.409


Optimization of data resampling through GA for the classification of imbalanced datasets

^{(1) *} Filippo Galli

(TeCIP Institute, Scuola Superiore Santâ€™Anna, via Moruzzi 1, Italy)
⁽²⁾ Marco Vannucci

(TeCIP Institute, Scuola Superiore Santâ€™Anna, via Moruzzi 1, Italy)
⁽³⁾ Valentina Colla

(TeCIP Institute, Scuola Superiore Santâ€™Anna, via Moruzzi 1, Italy)
^*corresponding author

Abstract

Classification of imbalanced datasets is a critical problem in numerous contexts. In these applications, standard methods are not able to satisfactorily detect rare patterns due to multiple factors that bias the classifiers toward the frequent class. This paper overview a novel family of methods for the resampling of an imbalanced dataset in order to maximize the performance of arbitrary data-driven classifiers. The presented approaches exploit genetic algorithms (GA) for the optimization of the data selection process according to a set of criteria that assess each candidate sample suitability. A comparison among the presented techniques on a set of industrial and literature datasets put into evidence the validity of this family of approaches, which is able not only to improve the performance of a standard classifier but also to determine the optimal resampling rate automatically. Future activities for the improvement of the proposed approach will include the development of new criteria for the assessment of sample suitability.

Keywords

Imbalanced datasets; Classification; Data resampling; Genetic algorithm

DOI

https://doi.org/10.26555/ijain.v5i3.409

Article metrics

Abstract views : 2548 | PDF views : 245

Cite

How to cite item

Full Text

Download

References

[1] A. Borselli, V. Colla, M. Vannucci, and M. Veroli, â€œA fuzzy inference system applied to defect detection in flat steel production,â€ in International Conference on Fuzzy Systems, 2010, pp. 1â€“6, doi: 10.1109/FUZZY.2010.5584036.

[2] M. Vannucci and V. Colla, â€œClassification of Unbalanced Datasets and Detection of Rare Events in Industry: Issues and Solutions,â€ 2016, pp. 337â€“351, doi: 10.1007/978-3-319-44188-7_26.

[3] M. A. Mazurowski, P. A. Habas, J. M. Zurada, J. Y. Lo, J. A. Baker, and G. D. Tourassi, â€œTraining neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance,â€ Neural Networks, vol. 21, no. 2â€“3, pp. 427â€“436, Mar. 2008, doi: 10.1016/j.neunet.2007.12.031.

[4] J.-J. Liao, C.-H. Shih, T.-F. Chen, and M.-F. Hsu, â€œAn ensemble-based model for two-class imbalanced financial problem,â€ Econ. Model., vol. 37, pp. 175â€“183, Feb. 2014, doi: 10.1016/j.econmod.2013.11.013.

[5] N. S. Sani, M. Abdul Rahman, A. Abu Bakar, S. Sahran, and H. Mohd Sarim, â€œMachine Learning Approach for Bottom 40 Percent Households (B40) Poverty Classification,â€ Int. J. Adv. Sci. Eng. Inf. Technol., vol. 8, no. 4â€“2, p. 1698, Sep. 2018, doi: 10.18517/ijaseit.8.4-2.6829.

[6] Haibo He and E. A. Garcia, â€œLearning from Imbalanced Data,â€ IEEE Trans. Knowl. Data Eng., vol. 21, no. 9, pp. 1263â€“1284, Sep. 2009, doi: 10.1109/TKDE.2008.239.

[7] A. Estabrooks, T. Jo, and N. Japkowicz, â€œA Multiple Resampling Method for Learning from Imbalanced Data Sets,â€ Comput. Intell., vol. 20, no. 1, pp. 18â€“36, Feb. 2004, doi: 10.1111/j.0824-7935.2004.t01-1-00228.x.

[8] N. Japkowicz and S. Stephen, â€œThe class imbalance problem: A systematic study1,â€ Intell. Data Anal., vol. 6, no. 5, pp. 429â€“449, Nov. 2002, doi: 10.3233/IDA-2002-6504.

[9] Y. Sun, A. K. C. Wong, and M. S. Kamel, â€œClassification of Imbalanced Data: a Review,â€ Int. J. Pattern Recognit. Artif. Intell., vol. 23, no. 04, pp. 687â€“719, Jun. 2009, doi: 10.1142/S0218001409007326.

[10] M. Vannucci, V. Colla, M. Sgarbi, and O. Toscanelli, â€œThresholded Neural Networks for Sensitive Industrial Classification Tasks,â€ 2009, pp. 1320â€“1327, doi: 10.1007/978-3-642-02478-8_165.

[11] M. Vannucci and V. Colla, â€œNovel classification method for sensitive problems and uneven datasets based on neural networks and fuzzy logic,â€ Appl. Soft Comput., vol. 11, no. 2, pp. 2383â€“2390, Mar. 2011, doi: 10.1016/j.asoc.2010.09.001.

[12] V. Soler and M. Prim, â€œRectangular Basis Functions Applied to Imbalanced Datasets,â€ 2007, pp. 511â€“519, doi: 10.1007/978-3-540-74690-4_52.

[13] Yuchun Tang, Yan-Qing Zhang, N. V. Chawla, and S. Krasser, â€œSVMs Modeling for Highly Imbalanced Classification,â€ IEEE Trans. Syst. Man, Cybern. Part B, vol. 39, no. 1, pp. 281â€“288, Feb. 2009, doi: 10.1109/TSMCB.2008.2002909.

[14] R. Batuwita and V. Palade, â€œFSVM-CIL: Fuzzy Support Vector Machines for Class Imbalance Learning,â€ IEEE Trans. Fuzzy Syst., vol. 18, no. 3, pp. 558â€“571, Jun. 2010, doi: 10.1109/TFUZZ.2010.2042721.

[15] Y. Sun, M. S. Kamel, A. K. C. Wong, and Y. Wang, â€œCost-sensitive boosting for classification of imbalanced data,â€ Pattern Recognit., vol. 40, no. 12, pp. 3358â€“3378, Dec. 2007, doi: 10.1016/j.patcog.2007.04.009.

[16] Z. Yuan, D. Bao, Z. Chen, and M. Liu, â€œIntegrated Transfer Learning Algorithm Using Multi-source TrAdaBoost for Unbalanced Samples Classification,â€ in 2017 International Conference on Computing Intelligence and Information System (CIIS), 2017, pp. 188â€“195, doi: 10.1109/CIIS.2017.37.

[17] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, â€œSMOTE: Synthetic Minority Over-sampling Technique,â€ J. Artif. Intell. Res., vol. 16, pp. 321â€“357, Jun. 2002, doi: 10.1613/jair.953.

[18] V. GarcÃa, J. S. SÃ¡nchez, and R. A. Mollineda, â€œOn the effectiveness of preprocessing methods when dealing with different levels of class imbalance,â€ Knowledge-Based Syst., vol. 25, no. 1, pp. 13â€“21, Feb. 2012, doi: 10.1016/j.knosys.2011.06.013.

[19] F. Charte, A. J. Rivera, M. J. del Jesus, and F. Herrera, â€œAddressing imbalance in multilabel classification: Measures and random resampling algorithms,â€ Neurocomputing, vol. 163, pp. 3â€“16, Sep. 2015, doi: 10.1016/j.neucom.2014.08.091.

[20] G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard, â€œA study of the behavior of several methods for balancing machine learning training data,â€ ACM SIGKDD Explor. Newsl., vol. 6, no. 1, p. 20, Jun. 2004, doi: 10.1145/1007730.1007735.

[21] J. Laurikkala, â€œImproving Identification of Difficult Small Classes by Balancing Class Distribution,â€ 2001, pp. 63â€“66, doi: 10.1007/3-540-48229-6_9.

[22] N. Japkowicz, â€œThe Class Imbalance Problem: Significance and Strategies,â€ Proc. 2000 Int. Conf. Artif. Intell., 2000, doi: 10.1.1.35.1693.

[23] S.-J. Yen and Y.-S. Lee, â€œCluster-based under-sampling approaches for imbalanced data distributions,â€ Expert Syst. Appl., vol. 36, no. 3, pp. 5718â€“5727, Apr. 2009, doi: 10.1016/j.eswa.2008.06.108.

[24] E. Ramentol, Y. Caballero, R. Bello, and F. Herrera, â€œSMOTE-RSB *: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory,â€ Knowl. Inf. Syst., vol. 33, no. 2, pp. 245â€“265, Nov. 2012, doi: 10.1007/s10115-011-0465-6.

[25] S. Cateni, V. Colla, and M. Vannucci, â€œA method for resampling imbalanced datasets in binary classification tasks for real-world problems,â€ Neurocomputing, vol. 135, pp. 32â€“41, Jul. 2014, doi: 10.1016/j.neucom.2013.05.059.

[26] H. Hartono, O. S. Sitompul, T. Tulus, and E. B. Nababan, â€œBiased support vector machine and weighted-smote in handling class imbalance problem,â€ Int. J. Adv. Intell. Informatics, vol. 4, no. 1, p. 21, Mar. 2018, doi: 10.26555/ijain.v4i1.146.

[27] M. Vannucci and V. Colla, â€œSmart Under-Sampling for the Detection of Rare Patterns in Unbalanced Datasets,â€ 2016, pp. 395â€“404, doi: 10.1007/978-3-319-39630-9_33.

[28] M. Vannucci and V. Colla, â€œGenetic Algorithms Based Resampling for the Classification of Unbalanced Datasets,â€ 2018, pp. 23â€“32, doi: 10.1007/978-3-319-59424-8_3.

[29] M. Vannucci and V. Colla, â€œImbalanced Datasets Resampling Through Self Organizing Maps and Genetic Algorithms,â€ 2019, pp. 399â€“411, doi: 10.1007/978-3-030-20257-6_34.

[30] K. Bache and M. Lichman, â€œUCI Machine Learning Repository, University of California, School of Information and Computer Science,â€ Irvine, CA, 2013, available at : http://archive.ics.uci.edu/ml.

[31] S. Cateni, V. Colla, and M. Vannucci, â€œA Hybrid Feature Selection Method for Classification Purposes,â€ in 2014 European Modelling Symposium, 2014, pp. 39â€“44, doi: 10.1109/EMS.2014.44.

[32] S. Cateni, V. Colla, and M. Vannucci, â€œA genetic algorithm-based approach for selecting input variables and setting relevant network parameters of a SOM-based classifier,â€ Int. J. Simul. Syst. Sci. Technol., 2011, available at: Google Scholar .

[33] S. Cateni, V. Colla, and M. Vannucci, â€œGeneral Purpose Input Variables Extraction: A Genetic Algorithm Based Procedure GIVE A GAP,â€ in 2009 Ninth International Conference on Intelligent Systems Design and Applications, 2009, pp. 1278â€“1283, doi: 10.1109/ISDA.2009.190.

[34] M. Sgarbi, V. Colla, S. Cateni, and S. Higson, â€œPre-processing of data coming from a laser-EMAT system for non-destructive testing of steel slabs,â€ ISA Trans., vol. 51, no. 1, pp. 181â€“188, Jan. 2012, doi: 10.1016/j.isatra.2011.07.004.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

___________________________________________________________
International Journal of Advances in Intelligent Informatics
ISSN 2442-6571 (print) | 2548-3161 (online)
Organized by UAD and ASCEE Computer Society
Published by Universitas Ahmad Dahlan
W: http://ijain.org
E: info@ijain.org (paper handling issues)
andri.pranolo.id@ieee.org (publication issues)

View IJAIN Stats

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0

Username
Password
Remember me