Abstract:
This paper provides a thorough analysis of extractive summarization, or the use of Natural
Language Processing (NLP) techniques to summarize news articles. Approximately two
thousand articles covering a wide range of topics, including business, entertainment,
politics, sports, and technology, were gathered from different online platforms, including
the well-known "Prothom Alo" newspaper. My method included a thorough preprocessing
step that included punctuation and special character removal, as well as spell correction
with TextBlob. The primary focus of my study is the implementation of the TextRank
algorithm, which was modified from the PageRank algorithm to handle natural language
text. Using this technique, text was represented as a graph, with edges denoting the cosine
similarity between sentences and vertices representing the sentences themselves. I
described my process for vectorizing sentences and creating a similarity matrix by figuring
out the cosine similarity between each pair. The paper explores the algorithmic nuances of
using a customized sentence similarity function to rank sentences according to their
relevance and importance. I then conducted a comparative analysis of the summaries
generated against the original texts, calculating similarity scores to evaluate the efficacy of
my summarization process. The study aims to highlight the effectiveness of extractive
summarization in processing large volumes of news data, offering insights into the
potential of NLP in media analytics. By comparing the actual summaries and those
generated through my method, I draw conclusions about the precision and utility of
extractive summarization in the context of diverse news content. This research contributes
to the field by demonstrating a practical application of NLP in the efficient processing and
summarization of large-scale news data.