Hate Speech Detection of Manglish (Malay + English) in X (Twitter) Using XLM-RoBERTa and XLNet

Authors

  • Farisha Binti Azmi UTM
  • Normaisharah Mamat
  • Nur Azaliah Abu Bakar
  • Siti Maherah Hussin

DOI:

https://doi.org/10.11113/oiji2025.13n2.347

Keywords:

Hate speech detection, Manglish, XLM-RoBERTa, XLNet, deep learning, social media, natural language processing

Abstract

This study explores hate speech detection in Manglish, a code-mixed language of Malay and English widely used among Malaysian social media users. The main objective is to develop and evaluate deep learning-based models capable of identifying hate speech in Manglish tweets. A dataset of 9,241 manually annotated tweets was collected from X (formerly Twitter) and processed using the Malaya NLP library for code-mixing detection. Two state-of-the-art transformer-based models, which are XLM-RoBERTa and XLNet, were fine-tuned under both imbalanced and upsampled training conditions. Evaluation metrics, including precision, recall, F1-score, accuracy, and evaluation loss, were used to assess model performance. Results indicate that XLNet achieved the highest F1-score and fastest inference time under imbalanced conditions, while XLM-RoBERTa demonstrated stronger generalization with lower evaluation loss. After upsampling, both models improved significantly, achieving balanced performance across both classes. This research contributes a novel annotated Manglish dataset and highlights the importance of context-aware multilingual models for hate speech detection in code-mixed social media posts.

Downloads

Published

2025-12-26

How to Cite

Binti Azmi, F., Mamat, N., Abu Bakar, N. A., & Hussin, S. M. (2025). Hate Speech Detection of Manglish (Malay + English) in X (Twitter) Using XLM-RoBERTa and XLNet. Open International Journal of Informatics, 13(2), 97–104. https://doi.org/10.11113/oiji2025.13n2.347