Abstract:
Technological breakthroughs and the interest of business entities have made the
categorization of media products increasingly conventional in this digital environment.
This is usually often a multilabel scenario in which an object might be labeled with
several categories. Most of the literature addresses the movie genre classification as a
mono-labeling task, generally based on audio-visual features. This study addressed a
multilabel movie genre classification model using both supervised and unsupervised
machine learning techniques to classify the movies into their corresponding genres. We
created a dataset consisting of English subtitle files taken from The Movie Database
(IMDB), which contains 1200 movies and each of the movies was labeled according to a
set of eleven genre labels. We experimented with two feature extraction methods
combined with the classifiers and a feature selection technique to reduce the
dimensionality of our proposed work. In this study, we compared the performance of
unsupervised and supervised techniques for the classification using several standard
performance measures using both feature representation methods. We assessed that the
best performers of the unsupervised techniques are K-means and Bisecting k-means in
the term of cluster quality. In contrast, we observed the model evaluation using KNN,
SVM and DT and find that SVM is better than the other classifiers among the supervised
techniques. Finally, we compared the unsupervised and supervised technique in the term
of quality of the clusters. We observed that the K-Means and Bisecting K-Means of
unsupervised technique produced the cluster of higher quality than the SVM, DT and
KNN supervised technique. We addressed the reason for the outliers of the training set
and recommended to use unsupervised techniques to improve the assignment of
predefining the categories and labeling the textual documents in the training set.