Improving Bangla Hate Speech Detection: An Ensemble Machine Learning Approach

Tarafde, Sonjoy Kumar

DSpace Home
→
Faculty of Science and Information Technology
→
Department of Computer Science and Engineering
→
Project Report
→
View Item

Improving Bangla Hate Speech Detection: An Ensemble Machine Learning Approach

Tarafde, Sonjoy Kumar

URI: http://dspace.daffodilvarsity.edu.bd:8080/handle/123456789/14671

Date: 2024-07-13

Abstract:

Hate speech that is spread online often targets individuals on the basis of many parts of their identity, such as their race, ethnicity, gender, sexual orientation, religion, nationality, disability, and other characteristics. These kinds of messages are often disseminated in Bangladesh via the use of Facebook and YouTube, which are two of the most popular social media sites in the nation. One significant issue is the promotion of hate within the celebrity comment section. In Bangladesh, there has been an increase in suicide attempts and violent incidents motivated by religious beliefs over the past few years. We now need to filter out comments and opinions like these from social media to maintain a pleasant atmosphere. I've been concentrating mainly on researching instances of hate speech in Bangla. In the past, there had been a few efforts made, but they had not been successful in fulfilling the expectations. The dataset that was used is enormous and includes more than 3,000 comments that were selected from different social media platforms. My contribution was to create a model that classified Bangla comments as "hate speech" or "normal speech" using hybrid machine learning approaches that combine two traditional models, like K-Nearest Neighbour (KNN) algorithms and Random Forest (RF), Nave Bayes and Decision Tree, Random Forest and Logistic Regression, Random Forest and SVM algorithms. This is referred to as the ensemble method. By using meticulous calculation, our process produces the most dependable result in Bangla. I compare how well each approach works and choose the model that does the best on our test data in terms of accuracy.