Abstract:
The emergence of objectionable content in the internet era has become a major worry, especially for groups speaking English. The focus of this work is on using machine learning methods to identify objectionable words in English. A machine learning model is trained on a particular English dataset that contains examples of both offensive and non-offensive material. The dataset is preprocessed, tokenized, and evaluated to create sequences that can be fed into a cutting-edge machine learning method. The dataset is used to train the machine learning-based model, which then uses techniques like word tokenization and TF-IDF to improve performance. To evaluate the efficacy of the model, assessment criteria like as accuracy, precision, recall, and F-1 score are utilized. Several strategies, including as data augmentation techniques and language-specific preprocessing techniques, are investigated to solve issues unique to English language offensive detection. The results demonstrate how well machine learning works to identify objectionable content in English, indicating the technology's potential to address this issue in settings where English is the primary language. The analysis demonstrates the Random Forest's better performance, showing that it obtained an accuracy of 96.04%. A other strategy that made use of Multi-Layer Perceptrons (MLP) also produced an accuracy of 95.865%. Additionally, ensemble models such as Bagging, Boosting, and Voting were used; their performance was optimized by hyper-parameter tweaking. Notably, recent experimental investigations of these data indicated that the Random Forest and MLP performed better than other models, achieving the greatest accuracy rate of 96.04% and 95.865% in offensive detection.