Bangla Dialect Classification and Standardization Using Traditional and Transformer-Based Approaches on a Custom Multi-Regional Corpus

Talukder, Md Shamim; Kholil, Md Ibrahim

DSpace Home
→
Faculty of Science and Information Technology
→
Department of Computer Science and Engineering
→
Project Report
→
View Item

dc.contributor.author	Talukder, Md Shamim
dc.contributor.author	Kholil, Md Ibrahim
dc.date.accessioned	2026-04-12T09:09:01Z
dc.date.available	2026-04-12T09:09:01Z
dc.date.issued	2025-09-16
dc.identifier.uri	http://dspace.daffodilvarsity.edu.bd:8080/handle/123456789/16706
dc.description	Project Report	en_US
dc.description.abstract	Bangla language is one of the most spoken languages in this world but one of the low resource languages in Natural Language Processing (NLP). The challenge is complicated by the existence of several regional dialects like Sylhet, Chittagong, Barishal, Noakhali and Khulna which are quite different from Standard Bangla. This thesis completes the dialect classification and conversion of dialects to standard Bangla dialects using a customized multi-regional corpus, utilizing traditional machine learning models and transformer-based models. A corpus was constructed involving 23,440 dialect and standard sentence pairs from 5 major dialects. Following processes like cleaning, normalization, and dataset splitting, the corpus was used for model training using traditional machine learning models, that is SVM, NB, LR, RF, and advance transformer architectures, that is BanglaBERT, mBERT, MuRIL and XLM-R for classification, and LSTM baseline, BanglaT5, mBART-50, mT5 for standardization. Evaluation used a large variety of metrics: Accuracy, Precision, recall, F1-score for classification, and BLEU, ROUGE-L, METEOR, chrF, TER, and Exact match for standardization. While SVM showed the best accuracy of 81.1%, MuRIL and XLM-R achieved up to 92.4% with macro-F1 of more than 0.92. For indication of the standardization, the mBART50 achieved BLEU = 0.78, ROUGE-L = 0.89, METEOR = 0.87, and Exact Match = 65.6%. A user-friendly Gradio interface has also been created to make the system accessible to any users. This study add a new dialectal corpus, a large study on traditional and transformer models, and build an NLP tool like other models. The result shows us that advanced transformer-based model is appropriate for dialect diversity of bangla and it can help us to create a way for a standardized digital communication in Bangla.	en_US
dc.description.sponsorship	Daffodil International University	en_US
dc.language.iso	en_US	en_US
dc.publisher	Daffodil International University	en_US
dc.subject	NLP	en_US
dc.subject	Bangla-Dialect	en_US
dc.subject	Classification	en_US
dc.subject	Standardization	en_US
dc.subject	Transformer Models (mBERT)	en_US
dc.title	Bangla Dialect Classification and Standardization Using Traditional and Transformer-Based Approaches on a Custom Multi-Regional Corpus	en_US
dc.type	Other	en_US