Abstract:
This research shows how a full speech-to-text and translation system was made to turn
sounds from five different ethnic languages into Bangla. These languages are Malo,
Bawm, Santali, Marma, and Garo. The study looks at how well different neural network
models, such as Bi-LSTM, CNN Seq2Seq, GRU Seq2Seq, Seq2Seq, and Transformer
models, handle difficult jobs like translating and transcribing languages. To get the
data, 20 native speakers of each language had to record 500 unique words. This made
a strong sample that could be used for training and testing. Noise reduction,
normalization, and segmentation were some of the preprocessing steps that made sure
the inputs were of good quality. Adding more data made the model more stable. In real
life, the Transformer model did better than others, with a 97% success rate in training
and a 96% success rate in confirmation. The GRU Seq2Seq model also did well, finding
a good balance between accuracy and speed. However, the CNN Seq2Seq model had a
hard time. With high accuracy and low word error rates across all languages, the
Whisper model did great at transcription jobs. The thorough test that used measures like
accuracy, loss, validation accuracy, and validation loss showed that the Transformer
model was the best at capturing long-range dependencies and context, which made it
the best for translation. The Whisper model's reliable performance in transcription jobs
is shown by how well it always does. This method has a big effect on people's lives
because it lets them talk to each other in their native languages, brings people together,
and protects cultural heritage. It gives minority language users more power by giving
them access to basic services in their own languages. This makes it easier for them to
be a part of society and improves their quality of life. This study builds a strong base
for future improvements in speech-to-text and translation tools, which will help more
people use languages and keep their cultures alive.