Comparison of Semi-Supervised Learning Performance in Indonesian Sentiment Analysis: An Empirical Study between Statistical Machine Learning and Deep Learning Approaches

Rochmat Husaini; Nur Heri Cahyana; Ida Wiendijarti; Agus Sasmito Aribowo

doi:10.31098/cset.v4i1.957

Authors

Rochmat Husaini Universitas Pembangunan Nasional Veteran Yogyakarta
Nur Heri Cahyana Universitas Pembangunan Nasional Veteran Yogyakarta
Ida Wiendijarti Universitas Pembangunan Nasional Veteran Yogyakarta
Agus Sasmito Aribowo Universitas Pembangunan Nasional Veteran Yogyakarta

DOI:

https://doi.org/10.31098/cset.v4i1.957

Keywords:

semi-supervised learning, sentiment analysis, statistical machine learning, Bi-LSTM, pseudo-labeling

Abstract

The limited availability of labeled data is a significant challenge in developing sentiment analysis models, especially for Indonesian, which still has minimal annotated resources. Semi-supervised learning (SSL) offers a solution by utilizing large amounts of unlabeled data. This study aims to compare the performance of two main paradigms in SSL—Statistical Machine Learning (SML) and Deep Learning (DL)—in the context of Indonesian text sentiment classification. Four SML models (KNN, Naïve Bayes, Random Forest, SVM) with TF-IDF, Word2Vec, and FastText feature representations were compared with a FastText embedding-based Bi-LSTM architecture that was fine-tuned. Experiments were conducted on two datasets: product reviews (14,000 instances) and social media (22,000 instances), each with only 10% of the initial labeled data. The self-training approach was applied with a confidence threshold of 0.8 and a maximum of 3 iterations. The results show that DL consistently outperforms in accuracy (achieving 89.7% vs. 84.2% on large datasets), F1-score (89.4% vs. 83.6%), and efficiency in utilizing unlabeled data (95.6% accepted pseudo-labels vs. 90.2%). However, this advantage comes at the cost of 4x higher computational costs and lower interpretability. SML remains relevant for scenarios with limited resources or when model transparency is a priority. This study recommends using DL if the infrastructure is adequate, and SML if interpretability and computational efficiency are prioritized. These findings provide empirical guidance for practitioners and academics in choosing the optimal SSL approach for Indonesian language sentiment analysis.

Comparison of Semi-Supervised Learning Performance in Indonesian Sentiment Analysis: An Empirical Study between Statistical Machine Learning and Deep Learning Approaches

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Make a Submission

quickmenu

statecounter