Multilabel Movie Genre Classification from Movie Subtitle Using Supervised and Unsupervised Machine Learning Approach

Hasan, Md. Mehedi; Debnath, Susanta Chandra; Hasan, Md. Mozahid

DSpace Home
→
Faculty of Science and Information Technology
→
Department of Computer Science and Engineering
→
Project Report
→
View Item

Multilabel Movie Genre Classification from Movie Subtitle Using Supervised and Unsupervised Machine Learning Approach

Hasan, Md. Mehedi; Debnath, Susanta Chandra; Hasan, Md. Mozahid

URI: http://dspace.daffodilvarsity.edu.bd:8080/handle/123456789/7110

Date: 2021-06-02

Abstract:

Technological breakthroughs and the interest of business entities have made the categorization of media products increasingly conventional in this digital environment. This is usually often a multilabel scenario in which an object might be labeled with several categories. Most of the literature addresses the movie genre classification as a mono-labeling task, generally based on audio-visual features. This study addressed a multilabel movie genre classification model using both supervised and unsupervised machine learning techniques to classify the movies into their corresponding genres. We created a dataset consisting of English subtitle files taken from The Movie Database (IMDB), which contains 1200 movies and each of the movies was labeled according to a set of eleven genre labels. We experimented with two feature extraction methods combined with the classifiers and a feature selection technique to reduce the dimensionality of our proposed work. In this study, we compared the performance of unsupervised and supervised techniques for the classification using several standard performance measures using both feature representation methods. We assessed that the best performers of the unsupervised techniques are K-means and Bisecting k-means in the term of cluster quality. In contrast, we observed the model evaluation using KNN, SVM and DT and find that SVM is better than the other classifiers among the supervised techniques. Finally, we compared the unsupervised and supervised technique in the term of quality of the clusters. We observed that the K-Means and Bisecting K-Means of unsupervised technique produced the cluster of higher quality than the SVM, DT and KNN supervised technique. We addressed the reason for the outliers of the training set and recommended to use unsupervised techniques to improve the assignment of predefining the categories and labeling the textual documents in the training set.

Show full item record