Performance Improvement of Speech Emotion Recognition in Bengali Language Using Deep Learning and BLSTM Networks.

Mia, Md Shumon

DSpace Home
→
Faculty of Science and Information Technology
→
Department of Computer Science and Engineering
→
Project Report
→
View Item

Performance Improvement of Speech Emotion Recognition in Bengali Language Using Deep Learning and BLSTM Networks.

Mia, Md Shumon

URI: http://dspace.daffodilvarsity.edu.bd:8080/handle/123456789/14464

Date: 2024-07-13

Abstract:

Speech Emotion Recognition (SER) is a new field in artificial intelligence (AI) and Bengali signal processing that has the potential to improve targeted user interactions and enable more positive interactions with smart devices. The goal of this work is to improve SER for Bengali, a language with limited resources in the suggested domain. Additionally, a novel deep learning model (DCNN-BLSTM) is proposed, which aims to improve the accuracy of emotion recognition by combining 1D-CNN, TDF, BLSTM networks. In this article, a deep learning model is trained using the audio data's Mel-Frequency Cepstral Coefficients (MFCCs) to create a system that can identify audio signals almost exactly like a human auditory system. Mel Frequency Cepstral Coefficients (MFCCs) are obtained by decoding the audio signal before proceeding with local feature learning blocks (LFLBs), which create the feature values using one-dimensional convolutional neural networks (CNNs). Due to the temporal properties of audio signals, these feature values are then added to the Bi-LSTM layer, which helps to improve temporal learning. The TDF layers ensure that temporal dynamics are preserved throughout the processing stages, while the Dropout layer improves model generalization. Lastly, the procedures of categorization and prediction are carried out using fully connected layers. The Bi-LSTM model demonstrates that the recovered features by 1D CNN are well captured because of the time-series properties of speech signals, as can be observed from the experimental evaluation of the SUBESCO database. Additionally, this study employs five distinct data augmentation strategies, each of which helps to increase recognition accuracy. On the SUBESCO dataset, the suggested model produced promising accuracy of 88%, respectively. The results indicate that, in comparison to comparable studies in voice emotion recognition, the suggested approach obtains greater recognition rates.