Thyroid Disease Detection Using Machine Learning

Ahmed, Md Mostakim; Shathy, Shamira Shams

DSpace Home
→
Faculty of Science and Information Technology
→
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
→
Project Report
→
View Item

Thyroid Disease Detection Using Machine Learning

Ahmed, Md Mostakim; Shathy, Shamira Shams

URI: http://dspace.daffodilvarsity.edu.bd:8080/handle/123456789/16765

Date: 2025-09-17

Abstract:

Thyroid disorders, such as hypothyroidism and hyperthyroidism, are challenging to diagnose due to overlapping symptoms like fatigue and weight changes, compounded by inconsistent medical data. This study leverages machine learning to enhance thyroid disease detection using two robust datasets: the Kaggle Thyroid Disease Dataset (9,172 records, 31 features) and the UCI Thyroid Disease Dataset (2,801 instances, 29 attributes). For the Kaggle dataset, a CatBoost classifier was developed after rigorous preprocessing, including data cleaning, zero imputation, one-hot encoding, and SMOTE with undersampling to address class imbalance. The optimized CatBoost model, incorporating L2 regularization and balanced class weights, achieved 98.70% accuracy, 98.79% precision (measuring correct positive predictions), and 97% Area Under the Precision-Recall Curve (AU-PRC) for hyperthyroidism, surpassing prior benchmarks by 2-3%. For the UCI dataset, Decision Tree and Random Forest classifiers were built following median/mode imputation, label encoding, feature scaling, and SMOTE. The Decision Tree excelled with 99.11% accuracy, 99.12% precision, 99.11% recall, 99.07% F1-score, and 98.53% (±0.36%) cross-validation accuracy, outperforming Random Forest (98.04% accuracy, 98.44% ±0.14% crossvalidation) and existing studies. Feature importance, elucidated by Shapley Additive Explanations (SHAP, a method for interpreting model predictions), identified T3, TT4, T4U, FTI, and TSH as critical predictors, offering transparent insights for clinicians. Despite strengths, limitations include potential dataset biases and the need for realworld validation. Excellent accuracy and interpretability are demonstrated by these tree-based models, which reduce the risk of misdiagnosis and pave the way for ethical deployment in healthcare. SHAP also ensures clear and trustworthy clinical decision support.