Multidisciplinary classification for Indonesian scientific  articles abstract using pre-trained BERT model

Antonius Angga Kurniawan; Sarifuddin Madenda; Setia Wirawan; Ruddy J. Suhatril

doi:10.26555/ijain.v9i2.1051


Multidisciplinary classification for Indonesian scientific articles abstract using pre-trained BERT model

^{(1) *} Antonius Angga Kurniawan

(Universitas Gunadarma, Indonesia)
⁽²⁾ Sarifuddin Madenda

(Universitas Gunadarma, Indonesia)
⁽³⁾ Setia Wirawan

(Universitas Gunadarma, Indonesia)
⁽⁴⁾ Ruddy J. Suhatril

(Universitas Gunadarma, Indonesia)
^*corresponding author

Abstract

Scientific articles now have multidisciplinary content. These make it difficult for researchers to find out relevant information. Some submissions are irrelevant to the journal's discipline. Categorizing articles and assessing their relevance can aid researchers and journals. Existing research still focuses on single-category predictive outcomes. Therefore, this research takes a new approach by applying a multidisciplinary classification for Indonesian scientific article abstracts using a pre-trained BERT model, showing the relevance between each category in an abstract. The dataset used was 9,000 abstracts with 9 disciplinary categories. On the dataset, text preprocessing is performed. The classification model was built by combining the pre-trained BERT model with Artificial Neural Network. Fine-tuning the hyperparameters is done to determine the most optimal hyperparameter combination for the model. The hyperparameters consist of batch size, learning rate, number of epochs, and data ratio. The best hyperparameter combination is a learning rate of 1e-5, batch size 32, epochs 3, and data ratio 9:1, with a validation accuracy value of 90.8%. The confusion matrix results of the model are compared with the confusion matrix results by experts. In this case, the highest accuracy result obtained by the model is 99.56%. A software prototype used the most accurate model to classify new data, displaying the top two prediction probabilities and the dominant category. This research produces a model that can be used to solve Indonesian text classification-related problems.

Keywords

Abstract; BERT; Classification; Hyperparameter-Tuning; Multidiciplinary

DOI

https://doi.org/10.26555/ijain.v9i2.1051

Article metrics

Abstract views : 1361 | PDF views : 249

Cite

How to cite item

Full Text

Download

References

[1] F. R. Lumbanraja, E. Fitri, Ardiansyah, A. Junaidi, and R. Prabowo, â€œAbstract Classification Using Support Vector Machine Algorithm (Case Study: Abstract in a Computer Science Journal),â€ J. Phys. Conf. Ser., vol. 1751, no. 1, pp. 0â€“12, 2021, doi: 10.1088/1742-6596/1751/1/012042.

[2] A. KP and J. Anitha, â€œPlant disease classification using deep learning,â€ in 2021 3rd International Conference on Signal Processing and Communication (ICPSC), May 2021, pp. 407â€“411, doi: 10.1109/ICSPC51351.2021.9451696.

[3] I. N. Khasanah, â€œSentiment Classification Using fastText Embedding and Deep Learning Model,â€ Procedia CIRP, vol. 189, pp. 343â€“350, 2021, doi: 10.1016/j.procs.2021.05.103.

[4] I. M. Fadhil and Y. Sibaroni, â€œTopic Classification in Indonesian-language Tweets using Fast-Text Feature Expansion with Support Vector Machine (SVM),â€ in 2022 International Conference on Data Science and Its Applications (ICoDSA), Jul. 2022, pp. 214â€“219, doi: 10.1109/ICoDSA55874.2022.9862899.

[5] R. Adipradana, B. P. Nayoga, R. Suryadi, and D. Suhartono, â€œHoax analyzer for Indonesian news using RNNs with fasttext and glove embeddings,â€ Bull. Electr. Eng. Informatics, vol. 10, no. 4, pp. 2130â€“2136, Aug. 2021, doi: 10.11591/eei.v10i4.2956.

[6] R. Kusumaningrum, I. Z. Nisa, R. P. Nawangsari, and A. Wibowo, â€œSentiment analysis of Indonesian hotel reviews: from classical machine learning to deep learning,â€ Int. J. Adv. Intell. Informatics, vol. 7, no. 3, pp. 292â€“303, Nov. 2021, doi: 10.26555/ijain.v7i3.737.

[7] M. S. David and S. Renjith, â€œComparison of word embeddings in text classification based on RNN and CNN,â€ IOP Conf. Ser. Mater. Sci. Eng., vol. 1187, no. 1, p. 012029, Sep. 2021, doi: 10.1088/1757-899X/1187/1/012029.

[8] M. I. Alfarizi, L. Syafaah, and M. Lestandy, â€œEmotional Text Classification Using TF-IDF (Term Frequency-Inverse Document Frequency) And LSTM (Long Short-Term Memory),â€ JUITA J. Inform., vol. 10, no. 2, p. 225, Nov. 2022, doi: 10.30595/juita.v10i2.13262.

[9] K. Boonchuay, â€œSentiment Classification Using Text Embedding for Thai Teaching Evaluation,â€ Appl. Mech. Mater., vol. 886, pp. 221â€“226, Jan. 2019, doi: 10.4028/www.scientific.net/AMM.886.221.

[10] J. Devlin, M.-W. Chang, K. Lee, K. T. Google, and A. I. Language, â€œBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,â€ Naacl-Hlt 2019, no. Mlm, pp. 1-16, 2019, doi: 10.48550/arXiv.1810.04805.

[11] M. Munikar, S. Shakya, and A. Shrestha, â€œFine-grained Sentiment Classification using BERT,â€ Int. Conf. Artif. Intell. Transform. Bus. Soc. AITB 2019, vol. 1, pp. 1â€“5, 2019, doi: 10.1109/AITB48515.2019.8947435.

[12] S. Abdul, Y. Qiang, S. Basit, and W. Ahmad, â€œUsing BERT for Checking the Polarity of Movie Reviews,â€ Int. J. Comput. Appl., vol. 177, no. 21, pp. 37â€“41, 2019, doi: 10.5120/ijca2019919675.

[13] W. Maharani, â€œSentiment Analysis during Jakarta Flood for Emergency Responses and Situational Awareness in Disaster Management using BERT,â€ 2020 8th Int. Conf. Inf. Commun. Technol. ICoICT 2020, pp. 1â€“5, 2020, doi: 10.1109/ICoICT49345.2020.9166407.

[14] J. Ravi and S. Kulkarni, â€œText embedding techniques for efficient clustering of twitter data,â€ Evol. Intell., pp. 1-11, Feb. 2023, doi: 10.1007/s12065-023-00825-3.

[15] M. Khadhraoui, H. Bellaaj, M. Ben Ammar, H. Hamam, and M. Jmaiel, â€œSurvey of BERT-Base Models for Scientific Text Classification: COVID-19 Case Study,â€ Appl. Sci., vol. 12, no. 6, p. 2891, Mar. 2022, doi: 10.3390/app12062891.

[16] X. Chen, P. Cong, and S. Lv, â€œA Long-Text Classification Method of Chinese News Based on BERT and CNN,â€ IEEE Access, vol. 10, pp. 34046â€“34057, 2022, doi: 10.1109/ACCESS.2022.3162614.

[17] G. Danilov, T. Ishankulov, K. Kotik, Y. Orlov, M. Shifrin, and A. Potapov, â€œThe Classification of Short Scientific Texts Using Pretrained BERT Model,â€ vol. 281, pp. 83-87, 2021, doi: 10.3233/SHTI210125.

[18] I. M. Rabbimov and S. S. Kobilov, â€œMulti-Class Text Classification of Uzbek News Articles using Machine Learning,â€ in Journal of Physics: Conference Series, May 2020, vol. 1546, no. 1, pp. 012097, doi: 10.1088/1742-6596/1546/1/012097.

[19] A. Bogdanchikov, D. Ayazbayev, and I. Varlamis, â€œClassification of Scientific Documents in the Kazakh Language Using Deep Neural Networks and a Fusion of Images and Text,â€ Big Data Cogn. Comput., vol. 6, no. 4, p. 123, Oct. 2022, doi: 10.3390/bdcc6040123.

[20] A. Barua, O. Sharif, and M. M. Hoque, â€œMulti-class Sports News Categorization using Machine Learning Techniques: Resource Creation and Evaluation,â€ Procedia Comput. Sci., vol. 193, pp. 112â€“121, 2021, doi: 10.1016/j.procs.2021.11.002.

[21] D. Ali, M. M. S. Missen, and M. Husnain, â€œMulticlass Event Classification from Text,â€ Sci. Program., vol. 2021, pp. 1â€“15, Jan. 2021, doi: 10.1155/2021/6660651.

[22] Y. A. Putra and M. L. Khodra, â€œDeep learning and distributional semantic model for Indonesian tweet categorization,â€ in 2016 International Conference on Data and Software Engineering (ICoDSE), 2016, pp. 1â€“6, doi: 10.1109/ICODSE.2016.7936108.

[23] A. K. Uysal and S. Gunal, â€œThe impact of preprocessing on text classification,â€ Inf. Process. Manag., vol. 50, no. 1, pp. 104â€“112, Jan. 2014, doi: 10.1016/j.ipm.2013.08.006.

[24] D. Haryalesmana and M. Wieriks, â€œIndonesian Stopword Corpus,â€ 2016. [Online]. Available at: https://github.com/masdevid/ID-Stopwords

[25] F. Z. Tala, â€œA Study of Stemming Effects on Information Retrieval in Bahasa Indonesia,â€ Institute for Logic, Language and Computation. Universiteit van Amsterdam, The Netherlands., 2003. [Online]. Available at: https://eprints.illc.uva.nl/id/eprint/740/1/MoL-2003-02.text.pdf.

[26] W. C, â€œBERT-base-indonesian-522M,â€ Hugging Face, 2021. [Online]. Available at: https://huggingface.co/cahya/bert-base-indonesian-522M.

[27] A. Vaswani et al., â€œAttention is all you need,â€ in NIPSâ€™17: Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, pp. 6000â€“6010, doi: 10.48550/arXiv.1706.03762.

[28] Y. Bai, â€œRELU-Function and Derived Function Review,â€ SHS Web Conf., vol. 144, p. 02006, Aug. 2022, doi: 10.1051/shsconf/202214402006.

[29] S. Pothuganti, â€œReview on over-fitting and under-fitting problems in Machine Learning and solutions,â€ Int. J. Adv. Res. Electr. Electron. Instrum. Eng., vol. 7, pp. 3692â€“3695, Sep. 2018. [Online]. Available at: http://www.ijareeie.com/upload/2018/september/11A_PS_NC.PDF.

[30] I. Kandel and M. Castelli, â€œThe effect of batch size on the generalizability of the convolutional neural networks on a histopathology dataset,â€ ICT Express, vol. 6, no. 4, pp. 312â€“315, Dec. 2020, doi: 10.1016/j.icte.2020.04.010.

[31] H. Jindal, N. Sardana, and R. Mehta, â€œAnalyzing Performance of Deep Learning Techniques for Web Navigation Prediction,â€ Procedia Comput. Sci., vol. 167, pp. 1739â€“1748, 2020, doi: 10.1016/j.procs.2020.03.384.

[32] M. Hossin and M. N. Sulaiman, â€œA Review on Evaluation Metrics for Data Classification Evaluations,â€ Int. J. Data Min. Knowl. Manag. Process, vol. 5, no. 2, pp. 01â€“11, Mar. 2015, doi: 10.5121/ijdkp.2015.5201.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

___________________________________________________________
International Journal of Advances in Intelligent Informatics
ISSN 2442-6571 (print) | 2548-3161 (online)
Organized by UAD and ASCEE Computer Society
Published by Universitas Ahmad Dahlan
W: http://ijain.org
E: info@ijain.org (paper handling issues)
andri.pranolo.id@ieee.org (publication issues)

View IJAIN Stats

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0

Username
Password
Remember me