AI-Based Real-time Voice to Text and Emotion system to enhance communication for Deaf  individuals

Athoi, Farah Ulfat

DSpace Home
→
Faculty of Science and Information Technology
→
Department of Software Engineering
→
Thesis Report
→
View Item

dc.contributor.author	Athoi, Farah Ulfat
dc.date.accessioned	2026-04-25T09:22:11Z
dc.date.available	2026-04-25T09:22:11Z
dc.date.issued	2025-12-27
dc.identifier.citation	SWT	en_US
dc.identifier.uri	http://dspace.daffodilvarsity.edu.bd:8080/handle/123456789/17022
dc.description	Thesis Report	en_US
dc.description.abstract	In this research, an integrated Automatic Speech Recognition (ASR) and Speech Emotion Recognition (SER) system for the Bangla language has been developed. This system aims to make communication easier for hearing-impaired users. Although ASR and SER technologies are rapidly advancing worldwide, there is a lack of reliable datasets, emotion recognition models, and real-time subtitle systems for the Bangla language. To address this issue, I collected a total of 1,400 audio samples—600 Normal, 400 Angry, and 400 Sad. To clean the audio, I applied Voice Activity Detection (VAD), noise reduction, and trimming. The ASR component uses the Whisper model. Initially, the Word Error Rate (WER) was 58.8% and the Character Error Rate (CER) was 28.2%. After cleaning and preprocessing the data, both WER and CER decreased significantly, improving the system's transcription quality. For the SER component, LSTM, Random Forest, and SVC three models were tested. The SVC model showed the highest accuracy at 96.09%. These results indicate that SVC provides comparatively the most stable and effective performance in emotion recognition for the Bangla language. This research proposes the design and development of an integrated Automatic Speech Recognition (ASR) and Speech Emotion Recognition (SER) system for the Bengali language. The planned system will convert the speaker's voice into written form in real- time. Additionally, the system will identify the speaker's emotion into three categories— Normal, Angry, or Sad. Although, in practical implementation, noisy environments or multilingual support may pose challenges. However, through initial research and the application of data pre-processing techniques, there is significant potential to enhance the system's performance and accuracy.	en_US
dc.description.sponsorship	DIU	en_US
dc.language.iso	en_US	en_US
dc.publisher	Daffodil International University	en_US
dc.subject	Speech-to-Text Recognition	en_US
dc.subject	Emotion Detection Assistive	en_US
dc.subject	Technology Real-Time Processing	en_US
dc.subject	Deaf Communication	en_US
dc.title	AI-Based Real-time Voice to Text and Emotion system to enhance communication for Deaf individuals	en_US
dc.type	Thesis	en_US