| dc.description.abstract |
This increasing sophistication in natural language processing has in a paradoxical manner increased the digital divide and left thousands of lowresource languages and dialects underserved by technology. TriVashi is the first end-to-end, multi-stage speech-to-speech translator of the under-resourced dialect of the Noakhali, Sylheti, and Chittagong dialect of Bangladesh, where no integrated solution previously existed, and this research directly challenges this issue of digital linguistic inequality. The main outcome of this work is the establishment and the publication of a new, gender-balanced, 15,006-sample audio and parallel text corpus as an exemplary basis that is likely to stimulate future innovation. It uses a four-stage cascaded architecture, based on a proposed system identifies the dialect through a new visual-analytic system; this method re-frames the problem as an image classification challenge by transforming audio to Mel spectrograms and using a pre-trained DenseNet121- SVM classifier with the best 92.7%accuracy. After detection, the audio is sent to dialect specific Automatic Speech Recognition (ASR) and Neural Machine Translation (NMT) models. The experimental findings confirm the transformative effectiveness of transfer learning; with fine-tuning of large pretrained models (Whisper-Small and BanglaT5) on the curated dialectal data, the ASR Word Error Rate (WER) dropped monumentally, starting at more than 289.2 percent, down to as low as 3.0, and NMT BLEU scores surged to 56.3. The new state-of-the-art standards set in this work are accompanied by a strong and replicable methodological template that confirms the small data, big model paradigm as a viable way of facilitating digital equity to under-resourced languages around the world. |
en_US |