Efficiency Demonstration of Embedding Models and Libraries for Bengali Word Vector Representation

Bulbul, Aminul Islam; Das, Saurav; Tasnim, Tamanna

DSpace Home
→
Faculty of Science and Information Technology
→
Department of Computer Science and Engineering
→
Project Report
→
View Item

Efficiency Demonstration of Embedding Models and Libraries for Bengali Word Vector Representation

Bulbul, Aminul Islam; Das, Saurav; Tasnim, Tamanna

URI: http://dspace.daffodilvarsity.edu.bd:8080/handle/123456789/10295

Date: 23-02-12

Abstract:

Word embedding demonstrates the magic in the field of NLP Our goal is to find out the best embedding model. Finding out the best embedding model for specific tasks is difficult. Embedding demonstrates different results according to size and source of data set in various embedding tasks. The purpose of the study is to find out the performance it shows for different types of embedding tasks. Researchers have invented several embedding models after finding the magical performance of word-embedding in the field of NLP. In our paper we discussed CBOW, skip-gram and Glove models performance. The models do embedding by representing the word into vector forms. We collect 2.5 lakh Bengali newspapers articles from a renowned newspaper of BD. We trained the architectures CBOW and skip-gram, which is for wor2vec and FastText models, dataset containing 20 million Bengali words. We use the same data set for training the Glove model. For collecting such a large amount of data, we build a web scraper by using Scrapy. Gensim, FastText and the python library has been used for training these three models consequently. For evaluating the models, we perform various word embedding tasks namely word analogy, semantic and syntactic prediction of words. Surprisingly they FastText perform in a better way for semantic and syntactic tasks than others. On the other hand, for analogy task, the performance was almost same for all the models except Glove.

Show full item record