dc.description.abstract |
The goal of this project is to reliably detect linguistic variants and dialects by classifying
regional languages spoken in Bangladesh using machine learning (ML) and deep learning
(DL) techniques. The dataset has 3000 entries, with a sufficient representation of each of
the five major regional languages (Chattogram: 655, Dhaka: 608, Rangpur: 621, Sylhet:
553, Noakhali: 562). The entries are distributed among these five major languages. The
procedure of collecting data included developing a survey form, obtaining and preparing
text samples, and cleaning data using natural language processing methods. Neural Bayes
(BNB), Support Vector Machines (SVM), Random Forest, Bi-directional Long ShortTerm Memory (Bi-LSTM), Logistic Regression (LR), and Convolutional Neural
Networks (CNN) were among the ML and DL models that were assessed. According to
the results, DL models (Bi-LSTM: 95.24%, CNN: 98.48%) are much better at classifying
regional languages than classic ML methods (Random Forest: 70.00%, SVM: 67.78%,
LR: 66.22%, BNB: 64.44%). All in all, this study highlights how well DL methods
capture complex linguistic patterns that are essential for problems involving the
classification of regional languages. It highlights the importance of Bangladesh's
language diversity from a cultural standpoint and promotes ethical research methods to
help preserve languages and promote social inclusion. Prospective avenues for
investigation encompass augmenting the intricacy of the model through syntactic and
semantic evaluations, in addition to examining the wider sociocultural implications of
language categorization technology. |
en_US |