| dc.description.abstract |
Bangla language is one of the most spoken languages in this world but one of the low
resource languages in Natural Language Processing (NLP). The challenge is
complicated by the existence of several regional dialects like Sylhet, Chittagong,
Barishal, Noakhali and Khulna which are quite different from Standard Bangla. This
thesis completes the dialect classification and conversion of dialects to standard
Bangla dialects using a customized multi-regional corpus, utilizing traditional
machine learning models and transformer-based models. A corpus was constructed
involving 23,440 dialect and standard sentence pairs from 5 major dialects. Following
processes like cleaning, normalization, and dataset splitting, the corpus was used for
model training using traditional machine learning models, that is SVM, NB, LR, RF,
and advance transformer architectures, that is BanglaBERT, mBERT, MuRIL and
XLM-R for classification, and LSTM baseline, BanglaT5, mBART-50, mT5 for
standardization. Evaluation used a large variety of metrics: Accuracy, Precision,
recall, F1-score for classification, and BLEU, ROUGE-L, METEOR, chrF, TER, and
Exact match for standardization. While SVM showed the best accuracy of 81.1%,
MuRIL and XLM-R achieved up to 92.4% with macro-F1 of more than 0.92. For
indication of the standardization, the mBART50 achieved BLEU = 0.78, ROUGE-L =
0.89, METEOR = 0.87, and Exact Match = 65.6%. A user-friendly Gradio interface has
also been created to make the system accessible to any users. This study add a new
dialectal corpus, a large study on traditional and transformer models, and build an
NLP tool like other models. The result shows us that advanced transformer-based
model is appropriate for dialect diversity of bangla and it can help us to create a way
for a standardized digital communication in Bangla. |
en_US |