| dc.description.abstract |
Banglish, the informal hybrid of Bengali and English typed in Latin alphabet, presents
its own challenges along with nonstandard spelling and transliteration drift in the
form of code-switching. This study aims to create a full pipeline to deal with Banglish
from data acquisition and annotation to modeling and analysis over 7 categories
(Appearance, Not Hate, Others, Racial, Religious, Sexual, and Slang). In the study, we
create and clean a social media corpus, design a preprocessing suite [custom stop word
filtering, regex tokenization, and rule-based normalization of spelling variants]
tailored for Banglish, and address class imbalance via staged over- and under-
sampling to a balanced set of 2,000 instances per class. To understand the model’s
performance, we test recurrent architectures (LSTM, GRU, BiLSTM, BiGRU) and
their hybrids (LSTM+GRU, BiLSTM+BiGRU) against transformer models (mBERT,
XLM-RoBERTa) under equal training conditions. The mBERT model shows the best
performance (accuracy 0.88, macro-F1 0.87), followed by BiLSTM+BiGRU among RNN
models (accuracy 0.84, macro-F1 0.84), whereas XLM-RoBERTa performs (accuracy
0.75, macro-F1 0.74) the worst, which implies that transformers outperform other
models for this task. A confusion-matrix analysis reveals that RNNs consistently fail
by collapsing ambiguous classes (Not Hate, Others, Sexual) into Appearance. This
failure is substantially reduced by mBERT. We conclude that, with Banglish-specific
preprocessing and balanced evaluation, multilingual transformers provide the most
reliable basis for moderning Banglish content, while under tighter |
en_US |