Abstract:
Effective information retrieval and organization have become increasingly important, especially
in contexts involving diverse cultural backgrounds, as the continued growth of digital content has
demonstrated. The subject matter of this paper is the clustering of Bengali news using the Kmeans algorithm, which integrates LSA. Because of being uncommon, clustering news based on
latent semantic analysis poses a tricky problem. Document clustering is also known as textual
document clustering. It is one form of cluster analysis. Recent research in this technological age
has focused on the implementation of text clustering techniques in diverse domains, including
text extraction for extracting vast quantities of valuable content from the Internet and automated
document organization [15] and [16]. This article introduces a more advantageous K-means
clustering news clustering framework for the purpose of clustering text or news documents. A
self-taught learning model is employed to cluster a given set of data into distinct groups,
obviating the need for external labels or identifiers. We analyzed a dataset consisting of
approximately 0.5 (504266) million portal news texts retrieved from several Bengali newspapers,
as well as seven distinct kinds of news content. To categorize the dataset using clustering and
semantic analysis, we first set the dataset up. Following that, the punctuation and keywords are
converted into codes so that deep learning techniques may be applied to them for the training
process. Once we have the learned groups, we cluster them using K-means. However, there are
certain things to work on, like data processing and the separation of sentences and punctuation.
We recommend a strategy neural network-based deep learning that can solve such issues. Since
no groundbreaking work has been done on news text or document clustering yet, this is an
effective method. Additionally, we have conducted a few experiments to show how the approach
is specifically implemented, confirming the proposed method's efficacy.