Abstract:
Document classification is a deep-rooted issue in information retrieval, and it assumes an
imperative part in an assortment of applications for an effective management of text and substantial volumes of unstructured data. Automatic document classification can be defined as a content-based assignment of some predefined categories to documents which is for sure less demanding to fetch the relevant data at the right time and for filtering and steering documents directly to users. For recovering data effortlessly at the minimum time, scientists around the globe are attempting to make content-based classifiers and an assortment of classification framework has been developed. Regardless, none of the classification methods is enough effective in light of the fact that they used some conventional algorithms. However, this paper proposes the Soft Cosine Measure as a content-based classification method. This classification method considers the similarity of features in a vector space model rather than considering the features as independent or completely different like all the existing traditional frameworks. For example, the proposed method considers ‘emperor’ and ‘king’ as the same word where all the remaining systems consider these as two different words. Besides, both Term Frequency (TF) and Term Frequency-Inverse Document Frequency (TF-IDF) algorithms are used to train the system which confirms the classification accuracy up to 98.06%.