Abstract:
In this paper we introduce a large scale, structured dataset of Bangla news articles with
320k instances under several predefined classes (Science & Technology, International,
National, Sports, Entertainment, Economy, Politics and Education) that aims to advance
Bengali Natural Language Processing (NLP). Objective — to solve text classification
problem for Bangla contents. A range of deep-learning models has been used for
classifying the articles, where Bangla-BERT—a transformer-based model had attained an
accuracy: 92% which was better than others. Other architectures (GRU, LSTM, CNN and
a Hybrid Model) were also implemented and tested but Bangla-BERT outperformed with
the highest accuracy. The present holistic dataset and the resulting insights on model
performance allow a significant addition to available resources with Bangla NLP and an
accurate benchmark for future works in this area. The implications of this work reach
academics and industry; the Bangladeshi National Newspaper Organizations can use these
models for efficient article categorization, and the natural language processing researchers
are using an available dataset with insights on model effectiveness for Bangla text
classification. This work represents a small step towards bridging the gap for NLP
resources of Bengali language, and may pave the way for quantitative progress in
automated language processing for Bangla.