Abstract:
Sound or audio classification is always a challenging task since it’s not always handy to collect
the dataset. Even after the collection, it’s not guaranteed that any specific model of Machine
learning or deep learning will perform better. In this project, we have designed a novel approach
to classify musical instruments of different categories using Convolutional Neural Network
(CNN) and Recurrent Neural Network (RNN). Nowadays acoustic scene sound, music, and
speech are considered to be handled in the domain of audio since they have resemblance for the
application of digital signal processing (DSP). Recently the domain of image classification has
grown up very swiftly by the application of different machine learning models, so it is high time
to study numerous extensible models in the domain of audio classification. But collecting audio
data is not always feasible again training a network based on a small dataset and get the maximum
accuracy is a quite challenging task in the deep learning approaches. We took that challenge and
successfully built two models based on convolutional and recurrent neural networks. In order to
classify instrumental sounds, initially, we have to extract different features from the audio
samples. We used the best feature extraction technique which is Mel Frequency Cepstral
Coefficient (MFCC). MFCCs work like a human auditory system, that’s why it provides
extremely typical features from audio or music samples. However, we got around 95.76% and
87.62% accuracy for convolutional and recurrent neural network models respectively on 10
different music instrumental classes. Applying the same techniques we tried to classify human
emotion from their speech. It’s easy for humans to recognize others' emotions by hearing their voice
but for machines, it gives a nightmare experience. So we analyzed different techniques to perform
speaker discrimination and speech analysis to find efficient algorithms to perform this task.