Semi-supervised learning for sentiment classification with ensemble multi-classifier approach

(1) * Agus Sasmito Aribowo Mail (Universitas Pembangunan Nasional "Veteran" Yogyakarta Indonesia & FTMK UTeM Melaka Malaysia, Indonesia)
(2) Halizah Basiron Mail (Fakulti Teknologi Maklumat dan Komunikasi, Universiti Teknikal Malaysia Melaka, Malaysia)
(3) Noor Fazilla Abd Yusof Mail (Fakulti Teknologi Maklumat dan Komunikasi, Universiti Teknikal Malaysia Melaka, Malaysia)
*corresponding author


Supervised sentiment analysis ideally uses a fully labeled data set for modeling. However, this ideal condition requires a struggle in the label annotation process. Semi-supervised learning (SSL) has emerged as a promising method to avoid time-consuming and expensive data labeling without reducing model performance. However, the research on SSL is still limited and its performance needs to be improved. Thus, this study aims to create a new SSL-Model for sentiment analysis. The Ensemble Classifier SSL model for sentiment classification is introduced. The research went through pre-processing, vectorization, and feature extraction using TF-IDF and n-grams. Support Vector Machine (SVM) or Random Forest for tokenization was used to separate unigram, bigram, and trigram in model generation. Then, the outputs of these models were combined using stacking ensemble approach. Accuracy and F1-score were used for the evaluation. IMDB datasets and US Airlines were used to test the new SSL models. The conclusion is that the sentiment annotation accuracy is highly dependent on the suitability of the dataset with the machine learning algorithm. In IMDB dataset, which consists of two classes, it is better to use SVM. In the US Airlines consisting of three classes, SVM is better at improving the model performance against the baseline, but RF is better at achieving the baseline performance even though it fails to maintain the model performance.


Ensemble Multi-classifier; Semi-supervised; Sentiment Analysis; SVM; Random Forest



