Hate Speech Detection of Manglish (Malay + English) in X (Twitter) Using XLM-RoBERTa and XLNet
DOI:
https://doi.org/10.11113/oiji2025.13n2.347Keywords:
Hate speech detection, Manglish, XLM-RoBERTa, XLNet, deep learning, social media, natural language processingAbstract
This study explores hate speech detection in Manglish, a code-mixed language of Malay and English widely used among Malaysian social media users. The main objective is to develop and evaluate deep learning-based models capable of identifying hate speech in Manglish tweets. A dataset of 9,241 manually annotated tweets was collected from X (formerly Twitter) and processed using the Malaya NLP library for code-mixing detection. Two state-of-the-art transformer-based models, which are XLM-RoBERTa and XLNet, were fine-tuned under both imbalanced and upsampled training conditions. Evaluation metrics, including precision, recall, F1-score, accuracy, and evaluation loss, were used to assess model performance. Results indicate that XLNet achieved the highest F1-score and fastest inference time under imbalanced conditions, while XLM-RoBERTa demonstrated stronger generalization with lower evaluation loss. After upsampling, both models improved significantly, achieving balanced performance across both classes. This research contributes a novel annotated Manglish dataset and highlights the importance of context-aware multilingual models for hate speech detection in code-mixed social media posts.














