Content-based Document Classification using Soft Cosine Measure

Rana, Md. Shohel

DSpace Home
→
Faculty of Science and Information Technology
→
Department of Computer Science and Engineering
→
Project Report
→
View Item

Content-based Document Classification using Soft Cosine Measure

Rana, Md. Shohel

URI: http://hdl.handle.net/123456789/3150

Date: 2018-11

Abstract:

Document classification is a deep-rooted issue in information retrieval, and it assumes an imperative part in an assortment of applications for an effective management of text and substantial volumes of unstructured data. Automatic document classification can be defined as a content-based assignment of some predefined categories to documents which is for sure less demanding to fetch the relevant data at the right time and for filtering and steering documents directly to users. For recovering data effortlessly at the minimum time, scientists around the globe are attempting to make content-based classifiers and an assortment of classification framework has been developed. Regardless, none of the classification methods is enough effective in light of the fact that they used some conventional algorithms. However, this paper proposes the Soft Cosine Measure as a content-based classification method. This classification method considers the similarity of features in a vector space model rather than considering the features as independent or completely different like all the existing traditional frameworks. For example, the proposed method considers ‘emperor’ and ‘king’ as the same word where all the remaining systems consider these as two different words. Besides, both Term Frequency (TF) and Term Frequency-Inverse Document Frequency (TF-IDF) algorithms are used to train the system which confirms the classification accuracy up to 98.06%.

Show full item record