Abstract:
This thesis addresses the persistent threat of SQL injection attacks, which remain one of the most critical vulnerabilities in web applications despite the widespread use of firewalls and input filters. Such traditional defenses often fail to generalize to previously unseen attack patterns. To tackle this limitation, we develop and evaluate a machine learning based detection framework for SQLi, designed to be integrated into web security. Incoming SQLi query are first preprocessed and transformed into TF-IDF feature vectors, capturing both benign and malicious query patterns. On top of these features, we train and compare six supervised classifiers: Logistic Regression, Linear Support Vector Machine, Decision Tree, Random Forest, Complement Naive Bayes and XGBoost. Models are assessed using ROC-AUC, Precision-Recall AUC (PR-AP), confusion matrices and class wise precision, recall and F1-score on a validation set of 3,981 samples. All the models achieved strong validation performance (ROC-AUC ≥ 99.57%, PR-AP ≥ 98.91%), with Random Forest and Logistic Regression showing particularly high accuracy. Logistic Regression is selected as the primary model based on its best validation PR-AP (99.90%) and consistently high F1-scores for both classes. On an independent test set of 4,280 requests, the selected model attains a ROC-AUC of 99.97% and PR-AP of 99.99%. After optimizing the decision threshold using an F2-score constraint and a cost sensitive objective that heavily penalizes missed attacks, the deployed configuration reaches 99.93% overall accuracy, with macro-F1 of 99.64%, detecting 4,057 out of 4,058 SQLi queries and misclassifying only two benign requests as attacks. These results demonstrate that a carefully tuned, interpretation friendly linear model on TF-IDF features can deliver near perfect SQLi detection performance, offering a practical and easily deployable enhancement to existing web security mechanisms.