Thyroid Disease Detection Using Machine Learning

Ahmed, Md Mostakim; Shathy, Shamira Shams

DSpace Home
→
Faculty of Science and Information Technology
→
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
→
Project Report
→
View Item

dc.contributor.author	Ahmed, Md Mostakim
dc.contributor.author	Shathy, Shamira Shams
dc.date.accessioned	2026-04-12T09:33:11Z
dc.date.available	2026-04-12T09:33:11Z
dc.date.issued	2025-09-17
dc.identifier.uri	http://dspace.daffodilvarsity.edu.bd:8080/handle/123456789/16765
dc.description	Project Report	en_US
dc.description.abstract	Thyroid disorders, such as hypothyroidism and hyperthyroidism, are challenging to diagnose due to overlapping symptoms like fatigue and weight changes, compounded by inconsistent medical data. This study leverages machine learning to enhance thyroid disease detection using two robust datasets: the Kaggle Thyroid Disease Dataset (9,172 records, 31 features) and the UCI Thyroid Disease Dataset (2,801 instances, 29 attributes). For the Kaggle dataset, a CatBoost classifier was developed after rigorous preprocessing, including data cleaning, zero imputation, one-hot encoding, and SMOTE with undersampling to address class imbalance. The optimized CatBoost model, incorporating L2 regularization and balanced class weights, achieved 98.70% accuracy, 98.79% precision (measuring correct positive predictions), and 97% Area Under the Precision-Recall Curve (AU-PRC) for hyperthyroidism, surpassing prior benchmarks by 2-3%. For the UCI dataset, Decision Tree and Random Forest classifiers were built following median/mode imputation, label encoding, feature scaling, and SMOTE. The Decision Tree excelled with 99.11% accuracy, 99.12% precision, 99.11% recall, 99.07% F1-score, and 98.53% (±0.36%) cross-validation accuracy, outperforming Random Forest (98.04% accuracy, 98.44% ±0.14% crossvalidation) and existing studies. Feature importance, elucidated by Shapley Additive Explanations (SHAP, a method for interpreting model predictions), identified T3, TT4, T4U, FTI, and TSH as critical predictors, offering transparent insights for clinicians. Despite strengths, limitations include potential dataset biases and the need for realworld validation. Excellent accuracy and interpretability are demonstrated by these tree-based models, which reduce the risk of misdiagnosis and pave the way for ethical deployment in healthcare. SHAP also ensures clear and trustworthy clinical decision support.	en_US
dc.description.sponsorship	Daffodil International University	en_US
dc.language.iso	en_US	en_US
dc.publisher	Daffodil International University	en_US
dc.subject	Thyroid Disease Detection	en_US
dc.subject	Hypothyroidism And Hyperthyroidism	en_US
dc.subject	Machine Learning in Healthcare	en_US
dc.subject	CatBoost Classifier	en_US
dc.subject	Thyroid Dataset	en_US
dc.title	Thyroid Disease Detection Using Machine Learning	en_US
dc.type	Other	en_US