Beyond Words: Unraveling Text Complexity with Novel Dataset and a Classifier Application

Islam, Mohammad Shariful; Rony, Mohammad Abu Tareq; Saha, Pritom; Ahammad, Mejbah; Alam, Shah Md Nazmul; Rahman, Md Saifur

DSpace Home
→
DIU Faculty Publication
→
Articles
→
View Item

dc.contributor.author	Islam, Mohammad Shariful
dc.contributor.author	Rony, Mohammad Abu Tareq
dc.contributor.author	Saha, Pritom
dc.contributor.author	Ahammad, Mejbah
dc.contributor.author	Alam, Shah Md Nazmul
dc.contributor.author	Rahman, Md Saifur
dc.date.accessioned	2024-05-04T06:21:20Z
dc.date.available	2024-05-04T06:21:20Z
dc.date.issued	2023-02-27
dc.identifier.uri	http://dspace.daffodilvarsity.edu.bd:8080/handle/123456789/12216
dc.description.abstract	Text classification is a fundamental aspect of Natural Language Processing (NLP). This research presents a novel human-annotated English sentence dataset categorized into four classes (simple, complex, compound, complex-compound) containing 22331 sentences and a sophisticated sentence classifier tool offering the capability to analyze and classify sentences within English text with particular relevance to literature writing. This study explores its performance using three distinct feature representation methods: Bag-of-Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), and Word Embedding Features. The study involves the evaluation of four machine learning and two deep learning classifier models. BoW combined with Support Vector Classifier (SVC) and Logistic Regression (LR) demonstrated impressive accuracy rates, excelling in distinguishing sentence complexity. Word Embedding Features, specifically LSTM and RNN, offer a more profound semantic representation. LSTM stands out with the highest accuracy of 98.03% and balanced precision and recall, yielding an average F1-score of 97%. RNN, slightly less accurate at 97.75%, nevertheless exhibits competence in grasping sentence structure dependencies. It offers valuable insights for practical applications and contributes to the broader understanding of sentence structures and semantics.	en_US
dc.language.iso	en_US	en_US
dc.publisher	IEEE	en_US
dc.subject	Classification	en_US
dc.subject	Datasets	en_US
dc.subject	Natural language	en_US
dc.title	Beyond Words: Unraveling Text Complexity with Novel Dataset and a Classifier Application	en_US
dc.type	Article	en_US