Abstract:
In this research, an integrated Automatic Speech Recognition (ASR) and Speech Emotion Recognition (SER) system for the Bangla language has been developed. This system aims to make communication easier for hearing-impaired users. Although ASR and SER technologies are rapidly advancing worldwide, there is a lack of reliable datasets, emotion recognition models, and real-time subtitle systems for the Bangla language. To address this issue, I collected a total of 1,400 audio samples—600 Normal, 400 Angry, and 400 Sad. To clean the audio, I applied Voice Activity Detection (VAD), noise reduction, and trimming. The ASR component uses the Whisper model. Initially, the Word Error Rate (WER) was 58.8% and the Character Error Rate (CER) was 28.2%. After cleaning and preprocessing the data, both WER and CER decreased significantly, improving the system's transcription quality. For the SER component, LSTM, Random Forest, and SVC three models were tested. The SVC model showed the highest accuracy at 96.09%. These results indicate that SVC provides comparatively the most stable and effective performance in emotion recognition for the Bangla language. This research proposes the design and development of an integrated Automatic Speech Recognition (ASR) and Speech Emotion Recognition (SER) system for the Bengali language. The planned system will convert the speaker's voice into written form in real- time. Additionally, the system will identify the speaker's emotion into three categories— Normal, Angry, or Sad. Although, in practical implementation, noisy environments or multilingual support may pose challenges. However, through initial research and the application of data pre-processing techniques, there is significant potential to enhance the system's performance and accuracy.