Adapting Whisper AI for Multilingual Speech-To-Text Conversion  in Bangladeshi Ethnic Languages

Yeiad, Kabid; Jim, Jannatul Ferdushi

DSpace Home
→
Faculty of Science and Information Technology
→
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
→
Project Report
→
View Item

Adapting Whisper AI for Multilingual Speech-To-Text Conversion in Bangladeshi Ethnic Languages

Yeiad, Kabid; Jim, Jannatul Ferdushi

URI: http://dspace.daffodilvarsity.edu.bd:8080/handle/123456789/13995

Date: 2024-07-24

Abstract:

This research shows how a full speech-to-text and translation system was made to turn sounds from five different ethnic languages into Bangla. These languages are Malo, Bawm, Santali, Marma, and Garo. The study looks at how well different neural network models, such as Bi-LSTM, CNN Seq2Seq, GRU Seq2Seq, Seq2Seq, and Transformer models, handle difficult jobs like translating and transcribing languages. To get the data, 20 native speakers of each language had to record 500 unique words. This made a strong sample that could be used for training and testing. Noise reduction, normalization, and segmentation were some of the preprocessing steps that made sure the inputs were of good quality. Adding more data made the model more stable. In real life, the Transformer model did better than others, with a 97% success rate in training and a 96% success rate in confirmation. The GRU Seq2Seq model also did well, finding a good balance between accuracy and speed. However, the CNN Seq2Seq model had a hard time. With high accuracy and low word error rates across all languages, the Whisper model did great at transcription jobs. The thorough test that used measures like accuracy, loss, validation accuracy, and validation loss showed that the Transformer model was the best at capturing long-range dependencies and context, which made it the best for translation. The Whisper model's reliable performance in transcription jobs is shown by how well it always does. This method has a big effect on people's lives because it lets them talk to each other in their native languages, brings people together, and protects cultural heritage. It gives minority language users more power by giving them access to basic services in their own languages. This makes it easier for them to be a part of society and improves their quality of life. This study builds a strong base for future improvements in speech-to-text and translation tools, which will help more people use languages and keep their cultures alive.