dc.description.abstract |
Natural language processing (NLP) is the technique by which we process the human
language with the computer. Parts-of-Speech (POS) tagging is one of the fundamental
requirements for some NLP applications. It is considered as a solved problem for some
foreign languages, such as English, Chinese, due to higher accuracy (97%), where it is
still an unsolved problem for Bangla because of its ambiguity. Although making a POS
tagger for Bangla is not a new work, but each one of available POS taggers has different
kinds of limitations. We choose to develop an unsupervised system rather than a supervised system, because a supervised system needs a huge data resource for training
purpose and available resources in Bangla is really poor. Here we develop a POS tagger
mainly based on Bangla grammar especially suffixes. Because Bangla is a very inflectional language, where a single word has many variants based on their suffixes.
In this POS tagger, we assign 8 base POS tags, where some rules, based on Bangla
grammar and suffix, are applied to identify POS tags with the cooperation of verb root
dataset. To handle non-suffix words, a dataset of almost 14500 Bangla words, with having their default POS tags, is added with the system, which helps to increase the efficiency of this POS tagger. A modified version of previously used algorithm for suffix analysis is applied, which result in a satisfactory level of about 94.2%. |
en_US |