DSpace Repository

Bangla Dialect Classification and Standardization Using Traditional and Transformer-Based Approaches on a Custom Multi-Regional Corpus

Show simple item record

dc.contributor.author Talukder, Md Shamim
dc.contributor.author Kholil, Md Ibrahim
dc.date.accessioned 2026-04-12T09:09:01Z
dc.date.available 2026-04-12T09:09:01Z
dc.date.issued 2025-09-16
dc.identifier.uri http://dspace.daffodilvarsity.edu.bd:8080/handle/123456789/16706
dc.description Project Report en_US
dc.description.abstract Bangla language is one of the most spoken languages in this world but one of the low resource languages in Natural Language Processing (NLP). The challenge is complicated by the existence of several regional dialects like Sylhet, Chittagong, Barishal, Noakhali and Khulna which are quite different from Standard Bangla. This thesis completes the dialect classification and conversion of dialects to standard Bangla dialects using a customized multi-regional corpus, utilizing traditional machine learning models and transformer-based models. A corpus was constructed involving 23,440 dialect and standard sentence pairs from 5 major dialects. Following processes like cleaning, normalization, and dataset splitting, the corpus was used for model training using traditional machine learning models, that is SVM, NB, LR, RF, and advance transformer architectures, that is BanglaBERT, mBERT, MuRIL and XLM-R for classification, and LSTM baseline, BanglaT5, mBART-50, mT5 for standardization. Evaluation used a large variety of metrics: Accuracy, Precision, recall, F1-score for classification, and BLEU, ROUGE-L, METEOR, chrF, TER, and Exact match for standardization. While SVM showed the best accuracy of 81.1%, MuRIL and XLM-R achieved up to 92.4% with macro-F1 of more than 0.92. For indication of the standardization, the mBART50 achieved BLEU = 0.78, ROUGE-L = 0.89, METEOR = 0.87, and Exact Match = 65.6%. A user-friendly Gradio interface has also been created to make the system accessible to any users. This study add a new dialectal corpus, a large study on traditional and transformer models, and build an NLP tool like other models. The result shows us that advanced transformer-based model is appropriate for dialect diversity of bangla and it can help us to create a way for a standardized digital communication in Bangla. en_US
dc.description.sponsorship Daffodil International University en_US
dc.language.iso en_US en_US
dc.publisher Daffodil International University en_US
dc.subject NLP en_US
dc.subject Bangla-Dialect en_US
dc.subject Classification en_US
dc.subject Standardization en_US
dc.subject Transformer Models (mBERT) en_US
dc.title Bangla Dialect Classification and Standardization Using Traditional and Transformer-Based Approaches on a Custom Multi-Regional Corpus en_US
dc.type Other en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search DSpace


Browse

My Account