Abstract:
Speech Emotion Recognition (SER) is a new field in artificial intelligence (AI) and Bengali
signal processing that has the potential to improve targeted user interactions and enable
more positive interactions with smart devices. The goal of this work is to improve SER for
Bengali, a language with limited resources in the suggested domain. Additionally, a novel
deep learning model (DCNN-BLSTM) is proposed, which aims to improve the accuracy
of emotion recognition by combining 1D-CNN, TDF, BLSTM networks. In this article, a
deep learning model is trained using the audio data's Mel-Frequency Cepstral Coefficients
(MFCCs) to create a system that can identify audio signals almost exactly like a human
auditory system. Mel Frequency Cepstral Coefficients (MFCCs) are obtained by decoding
the audio signal before proceeding with local feature learning blocks (LFLBs), which
create the feature values using one-dimensional convolutional neural networks (CNNs).
Due to the temporal properties of audio signals, these feature values are then added to the
Bi-LSTM layer, which helps to improve temporal learning. The TDF layers ensure that
temporal dynamics are preserved throughout the processing stages, while the Dropout layer
improves model generalization. Lastly, the procedures of categorization and prediction are
carried out using fully connected layers. The Bi-LSTM model demonstrates that the
recovered features by 1D CNN are well captured because of the time-series properties of
speech signals, as can be observed from the experimental evaluation of the SUBESCO
database. Additionally, this study employs five distinct data augmentation strategies, each
of which helps to increase recognition accuracy. On the SUBESCO dataset, the suggested
model produced promising accuracy of 88%, respectively. The results indicate that, in
comparison to comparable studies in voice emotion recognition, the suggested approach
obtains greater recognition rates.