| dc.description.abstract |
This project develops an abstractive summarization system for Bangla news articles using small-scale transformer models, specifically small MT5 (300M parameters) and BT5 Base (247M parameters), to generate long summaries (100– 200 tokens) for in-depth insights and short summaries (30–50 tokens) for quick updates. Addressing the challenge of information overload in Bangla media, the system processes a curated dataset of 10,000 articles from sources like Prothom Alo and BBC Bangla, covering diverse topics. The methodology includes web scraping, advanced preprocessing to handle Bangla’s linguistic complexities (e.g., morphology, dialects, Unicode issues), fine-tuning on a P100 GPU, and evaluation using ROUGE, BLEU, CER/WER, and human ratings by native speakers. Small MT5 achieved ROUGE-1 F1 scores of 0.410 (long) and 0.380 (short), outperforming BT5 Base (0.230 and 0.210), which struggled with overfitting. The system enhances information accessibility for journalists, educators, and the public, aligning with SDGs 4, 9, and 10. Contributions include an open-source dataset, codebase, and models, paving the way for future Bangla NLP research despite limitations in dialect coverage and computational resources. |
en_US |