| dc.description.abstract |
Large Language Models (LLMs) have opened up a whole new horizon for literary analysis, authorship, and computational linguistics. But as yet, there has been limited research work conducted regarding the application of these models for Bangla literature, although Bangla has a vast tradition of literature, and most of the works have been digitized. This research work has been initiated with the motivation to close this gap and design an authorship prediction model for Bangla novels using the recent advances in LLMs. A data set was formed after the collection, pre-processing, and segmentation of texts from renowned Bangla writers like Rabindranath Tagore, Kazi Nazrul Islam, and Sarat Chandra Chattopadhyay. The texts were pre-processed and tokenized in order to obtain suitable input for the training of transformer models. The different pre-trained LLMs, namely BanglaBERT, mBERT, and XLM-RoBERTa models, were fine-tuned for classification in terms of authorship after identifying characteristics in the texts. Accuracy, precision, recall, and F1-score are utilized to evaluate trained models. Analysis of the results indicates that LLMs can effectively identify unique writing styles and subtle differences of various authors. The models performed well on accuracy and are considered impressive compared to traditional machine-learning techniques. This work gives a feasible approach towards author identification in the Bangla language and shows the potential of LLMs in the development of the digital humanities as well as authorial text analyses of literary works in low-resource languages. This can pave the way for other research related to plagiarism, literary forensics, and the conservation of the Bangla language identity in the domain of AI. |
en_US |