Cross-domain sentiment analysis model on Indonesian YouTube comment

(1) * Agus Sasmito Aribowo Mail (Universitas Pembangunan Nasional "Veteran" Yogyakarta Indonesia, Indonesia)
(2) Halizah Basiron Mail (Universiti Teknikal Malaysia Melaka, Malaysia)
(3) Noor Fazilla Abd Yusof Mail (Universiti Teknikal Malaysia Melaka, Malaysia)
(4) Siti Khomsah Mail (Insitut Teknologi Telkom Purwokerto, Indonesia)
*corresponding author


A cross-domain sentiment analysis (CDSA) study in the Indonesian language and tree-based ensemble machine learning is quite interesting. CDSA is useful to support the labeling process of cross-domain sentiment and reduce any dependence on the experts; however, the mechanism in the opinion unstructured by stop word, language expressions, and Indonesian slang words is unidentified yet. This study aimed to obtain the best model of CDSA for the opinion in Indonesia language that commonly is full of stop words and slang words in the Indonesian dialect. This study was purposely to observe the benefits of the stop words cleaning and slang words conversion in CDSA in the Indonesian language form. It was also to find out which machine learning method is suitable for this model. This study started by crawling five datasets of the comments on YouTube from 5 different domains. The dataset was copied into two groups: the dataset group without any process of stop word cleaning and slang word conversion and the dataset group to stop word cleaning and slang word conversion. CDSA model was built for each dataset group and then tested using two types of tree-based ensemble machine learning, i.e., Random Forest (RF) and Extra Tree (ET) classifier, and tested using three types of non-ensemble machine learning, including Naïve Bayes (NB), SVM, and Decision Tree (DT) as the comparison. Then, It can be suggested that the accuracy of CDSA in Indonesia Language increased if it still removed the stop words and converted the slang words. The best classifier model was built using tree-based ensemble machine learning, particularly ET, as in this study, the ET model could achieve the highest accuracy by 91.19%. This model is expected to be the CDSA technique alternative in the Indonesian language.


Cross-domain; Sentiment analysis; Tree-based ensemble ML; Remove stop word; Convert slang word



Article metrics

Abstract views : 1264 | PDF views : 332




Full Text



[1] T. Al-Moslmi, N. Omar, S. Abdullah, and M. Albared, “Approaches to Cross-Domain Sentiment Analysis: A Systematic Literature Review,” IEEE Access, vol. 5, no. c, pp. 16173–16192, 2017, doi: 10.1109/ACCESS.2017.2690342.

[2] J. S. Deshmukh and A. K. Tripathy, “Entropy Based Classifier for Cross-Domain Opinion Mining,” Appl. Comput. Informatics, vol. 14, no. 1, pp. 55–64, 2018, doi: 10.1016/j.aci.2017.03.001.

[3] A. A. Aziz, A. Starkey, and M. C. Bannerman, “Evaluating Cross Domain Sentiment Analysis using Supervised Machine Learning Techniques,” in 2017 Intelligent Systems Conference, IntelliSys 2017, 2017, no. September, pp. 689–696, doi: 10.1109/IntelliSys.2017.8324369.

[4] B. Heredia, T. M. Khoshgoftaar, J. Prusa, and M. Crawford, “Cross-Domain Sentiment Analysis: An Empirical Investigation,” in 2016 IEEE 17th International Conference on Information Reuse and Integration, 2016, pp. 160–165, doi: 10.1109/IRI.2016.28.

[5] F. Gräßer, H. Malberg, S. Kallumadi, and S. Zaunseder, “Aspect-Based Sentiment Analysis of Drug Reviews Applying Cross-Domain and Cross-Data Learning,” in 2018 International Digital Health Conference - ACM International Conference Proceeding Series, 2018, vol. 2018-April, pp. 121–125, doi: 10.1145/3194658.3194677.

[6] N. X. Bach, V. T. Hai, and T. M. Phuong, “Cross-Domain Sentiment Classification With Word Embeddings and Canonical Correlation Analysis,” in Proceedings of the Seventh Symposium on Information and Communication Technology, SoICT 2016, 2016, vol. 08-09-Dece, pp. 159–166, doi: 10.1145/3011077.3011104.

[7] F. H. Khan, U. Qamar, and S. Bashir, “Enhanced Cross-Domain Sentiment Classification Utilizing a Multi-Source Transfer Learning Approach,” Soft Comput., vol. 23, no. 14, pp. 5431–5442, 2018, doi: 10.1007/s00500-018-3187-9.

[8] K. Katsarou and D. S. Shekhawat, “CRD-Sentense: Cross-Domain Sentiment Analysis Using An Ensemble Model,” 11th Int. Conf. Manag. Digit. Ecosyst. MEDES 2019, no. November, pp. 88–94, 2019, doi: 10.1145/3297662.3365808.

[9] D. Bollegala, D. Weir, and J. Carroll, “Cross-Domain Sentiment Classification using a Sentiment Sensitive Thesaurus,” IEEE Trans. Knowl. Data Eng., vol. 25, no. 8, pp. 1719–1731, 2013, doi: 10.1109/TKDE.2012.103.

[10] R. Suharshala, K. Anoop, and V. L. Lajish, “Cross-Domain Sentiment Analysis on Social Media Interactions using Senti-Lexicon based Hybrid Features,” Proc. 3rd Int. Conf. Inven. Comput. Technol. ICICT 2018, pp. 772–777, 2018, doi: 10.1109/ICICT43934.2018.9034272.

[11] D. H. Jayani, “Orang Indonesia Habiskan Hampir 8 Jam untuk Berinternet,” 26 February 2020. 2020., Available at:

[12], “Pemakaian Bahasa Indonesia Termasuk Terbesar di Medsos,”, 2019, Available at:

[13] J. Savigny and A. Purwarianti, “Emotion classification on Youtube comments using word embedding,” in International Conference on Advanced Informatics: Concepts, Theory and Applications, 2017, pp. 1–5, doi: 10.1109/ICAICTA.2017.8090986.

[14] F. I. Tanesab, I. Sembiring, and H. D. Purnomo, “Sentiment analysis model based on Youtube comment using support vector machine,” Int. J. Comput. Sci. Softw. Eng., vol. 6, no. 8, pp. 180–185, 2017. Available at: Google Scholar

[15] A. G. Prasad, S. Sanjana, S. M. Bhat, and B. S. Harish, “Sentiment Analysis For Sarcasm Detection on Streaming Short Text Data,” in 2017 2nd International Conference on Knowledge Engineering and Applications, ICKEA 2017, 2017, no. 2009, pp. 1–5, doi: 10.1109/ICKEA.2017.8169892.

[16] M. Andriansyah et al., “Cyberbullying comment classification on Indonesian selebgram using support vector machine method,” in The 2nd International Conference on Informatics and Computing, 2018, vol. 2018-Janua, pp. 1–5, doi: 10.1109/IAC.2017.8280617.

[17] E. Rinaldi and A. Musdholifah, “FVEC-SVM for opinion mining on Indonesian comments of youtube video,” Proc. 2017 Int. Conf. Data Softw. Eng. ICoDSE 2017, vol. 2018-Janua, pp. 1–5, 2018, doi: 10.1109/ICODSE.2017.8285860.

[18] N. Anggraini and M. J. Tursina, “Sentiment analysis of school zoning system on Youtube social media using the K-nearest neighbor with levenshtein distance algorithm,” in 7th International Conference on Cyber and IT Service Management, 2019, no. May, pp. 1–4, doi: 10.1109/CITSM47753.2019.8965407.

[19] A. N. Muhammad, S. Bukhori, and P. Pandunata, “Sentiment analysis of positive and negative of YouTube comments using naïve bayes-support vector machine (NBSVM) classifier,” in International Conference on Computer Science, Information Technology, and Electrical Engineering, 2019, vol. 1, pp. 199–205, doi: 10.1109/ICOMITEE.2019.8920923.

[20] R. Novendri, A. S. Callista, D. N. Pratama, and C. E. Puspita, “Sentiment analysis of YouTube movie trailer comments using naïve bayes,” Bull. Comput. Sci. Electr. Eng., vol. 1, no. 1, pp. 26–32, 2020, doi: 10.25008/bcsee.v1i1.5.

[21] A. S. Aribowo, H. Basiron, N. S. Herman, and S. Khomsah, “Fanaticism Category Generation Using Tree-based Machine Learning Method,” J. Phys. Conf. Ser., vol. 1501, no. 1, 2020, doi: 10.1088/1742-6596/1501/1/012021.

[22] N. Sultana and M. M. Islam, “Meta classifier-based ensemble learning for sentiment classification,” in Proceedings of International Joint Conference on Computational Intelligence, e, Algorithms for Intelligent Systems, 2020, vol. 669, pp. 1–481, doi: 10.1007/978-981-13-7564-4.

[23] S. Khomsah and A. S. Aribowo, “Model text-preprocessing komentar Youtube dalam bahasa Indonesia,” Rekayasa Sist. dan Teknol. Informasi, RESTI, vol. 4, no. 4, pp. 648–654, 2020, doi: 10.29207/resti.v4i4.2035

[24] T. F. Abidin, M. Hasanuddin, and V. Mutiawani, “N-grams based features for Indonesian tweets classification problems,” Proc. - 2017 Int. Conf. Electr. Eng. Informatics Adv. Knowledge, Res. Technol. Humanit. ICELTICs 2017, vol. 2018-Janua, no. ICELTICs, pp. 307–310, 2017, doi: 10.1109/ICELTICS.2017.8253287.

[25] Y. Hao, T. Mu, R. Hong, M. Wang, X. Liu, and J. Y. Goulermas, “Cross-Domain Sentiment Encoding through Stochastic Word Embedding,” IEEE Trans. Knowl. Data Eng., vol. 32, no. 10, pp. 1909–1922, 2020, doi: 10.1109/TKDE.2019.2913379.

[26] B. ZHANG, X. XU1, M. YANG, X. CHEN, and Y. YE, “Cross-domain Sentiment Classification by Capsule Network with Semantic Rules,” IEEE Access, vol. 6, pp. 58284–58294, 2018, doi: 10.1109/ACCESS.2018.2874623.

[27] Naveen Bindra and Manu Sood, “Detecting DDoS Attacks Using Machine Learning Techniques and Contemporary Intrusion Detection Dataset,” Autom. Control Comput. Sci., vol. 53, no. 5, pp. 419–428, Sep. 2019, doi: 10.3103/S0146411619050043

[28] L. B. Shyamasundar and P. Jhansi Rani, “A multiple-layer machine learning architecture for improved accuracy in sentiment analysis,” Comput. J., vol. 63, no. 3, pp. 395–409, 2019, doi: 10.1093/comjnl/bxz038.

[29] N. Cahyana, S. Khomsah, and A. S. Aribowo, “Improving imbalanced dataset classification using oversampling and gradient boosting,” 5th Int. Conf. Sci. Inf. Technol. Embrac. Ind. 4.0 Towar. Innov. Cyber Phys. Syst. ICSITech, pp. 217–222, 2019, doi: 10.1109/ICSITech46713.2019.8987499.

[30] A. S. Aribowo, H. Basiron, N. S. Herman, and S. Khomsah, “An evaluation of preprocessing steps and tree-based ensemble machine learning for analysing sentiment on Indonesian youtube comments,” Int. J. Adv. Trends Comput. Sci. Eng., vol. 9, no. 5, pp. 7078–7086, 2020, doi: 10.30534/ijatcse/2020/29952020.

[31] S. Kaur, P. Kumar, and P. Kumaraguru, “Automating fake news detection system using multi-level voting model,” Soft Comput., vol. 24, no. 12, pp. 9049–9069, 2020, doi: 10.1007/s00500-019-04436-y.

[32] A. K. Mohamad, M. Jayakrishnan, and N. H. Nawi, “Employ twitter data to perform sentiment analysis in the Malay language,” Int. J. Adv. Trends Comput. Sci. Eng., vol. 9, no. 2, pp. 1404–1412, 2020, doi: 10.30534/ijatcse/2020/76922020.

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

International Journal of Advances in Intelligent Informatics
ISSN 2442-6571  (print) | 2548-3161 (online)
Organized by UAD and ASCEE Computer Society
Published by Universitas Ahmad Dahlan
E: (paper handling issues) (publication issues)

View IJAIN Stats

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0