Abstract:
This project addresses the problem of distinguishing between two form of Bangla language,
namely Sadhubhasha and Cholitobhasha. The classifier would be beneficial for finding the
right word choice for Bangla literature. The main vision of this project is to different the
modern era’s early Bangla form of Sadhubhasha to the current form of Cholitobhasha. As
far as we know there has been no single work done addressing this particular issue. From
another perspective, only a few works have been done on “Bangla Language”. So, it has
been difficult to conduct advance linguistic works on Bangla language like extracting
information or summarizing. We had to face difficulties when collecting Bangla data due
to the limited availability, but finally we have collected around total 100000 words dataset
for this project. Among which 80% of the data is used for training and rest 20% is test data.
Machine learning algorithms Random forest, Naïve Bayes, Support Vector Machine, Knearest neighbor and Decision tree are applied to classify the language and the Term
Frequency-Inverse Document Frequency and Bag of Words are used for the numerical
representation. With these classifiers 91% to 99.5% accuracy is observed. The promising
outcome of this project is, "sadhu and cholito Language classifier" can be used as the first
step on that ladder from where others will be influenced to do further research on Bangla
language.