Abstract:
Online platforms, mostly social media platforms have become a very important part of
our life. We are very used to online content and sites. But many URLs lead us to fake
sites. These are intentionally created to mislead users to gain certain information. This
generally leads us to account hacking or information thieves. To identify these sites and
stop users from using these URLs we will discuss machine learning algorithms and we
will also have a dataset and apply all those algorithms on our dataset. Dataset was
collected from various online open source platforms. A total of 20,000 data was
collected and used, half of which was fake URLs and the other half was real URLs.
First, we extracted many features from our initial dataset which was later used to train
our model. We used an anaconda environment to implement our project. A Jupyter
notebook was used to do the necessary codes. We were successful in extracting
necessary features and applying machine learning algorithms. The dataset was divided
into 80:20 ratio for training and testing purposes. The best supervised machine learning
algorithms were chosen to train our model. Random Forest Classifier got the highest
success from our model by gaining maximum accuracy. We got 97.50% accuracy from
the Random Forest Classifier. Finally, the model was saved for later improvements. By
this we believe we will have the best machine learning approach to detect fake content
or sites that are online. Hopefully this will help detect online fake URLs and save users
from its attacks