Hate Speech Detection of Manglish (Malay + English) in X (Twitter) Using XLM-RoBERTa and XLNet

Farisha Binti Azmi; Normaisharah Mamat; Nur Azaliah Abu Bakar; Siti Maherah Hussin

doi:10.11113/oiji2025.13n2.347

Authors

Farisha Binti Azmi UTM
Normaisharah Mamat
Nur Azaliah Abu Bakar
Siti Maherah Hussin

DOI:

https://doi.org/10.11113/oiji2025.13n2.347

Keywords:

Hate speech detection, Manglish, XLM-RoBERTa, XLNet, deep learning, social media, natural language processing

Abstract

This study explores hate speech detection in Manglish, a code-mixed language of Malay and English widely used among Malaysian social media users. The main objective is to develop and evaluate deep learning-based models capable of identifying hate speech in Manglish tweets. A dataset of 9,241 manually annotated tweets was collected from X (formerly Twitter) and processed using the Malaya NLP library for code-mixing detection. Two state-of-the-art transformer-based models, which are XLM-RoBERTa and XLNet, were fine-tuned under both imbalanced and upsampled training conditions. Evaluation metrics, including precision, recall, F1-score, accuracy, and evaluation loss, were used to assess model performance. Results indicate that XLNet achieved the highest F1-score and fastest inference time under imbalanced conditions, while XLM-RoBERTa demonstrated stronger generalization with lower evaluation loss. After upsampling, both models improved significantly, achieving balanced performance across both classes. This research contributes a novel annotated Manglish dataset and highlights the importance of context-aware multilingual models for hate speech detection in code-mixed social media posts.

Hate Speech Detection of Manglish (Malay + English) in X (Twitter) Using XLM-RoBERTa and XLNet

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

OIJI

Article Template