Abstract:
The early diagnosis of Thyroid cancer (TC) is vital for the improvement of patient survival rate, and prevention of overtreatment. Nevertheless, the medical datasets related to thyroid diseases usually have missing values, noise and class imbalanced which degrade performance of conventional machine learning models. To address such challenges, we propose a hybrid ensemble model called TCpred_Model that adopts the staking approach with Random Forest and XGBoost as base learners and utilizes Logistic Regression as the meta-classifier. The dataset was preprocessed by missing value treatment, label encoding, feature scaling and class-balancing applied by Synthetic Minority Oversampling Technique (SMOTE). The dataset was divided in a ratio of 80% (training) and 20%(testing), and several baseline models, including Logistic Regression, Random Forest, SVM and XGBoost were tested. Results of the experiments indicate that our proposed TCpred_Model performed better than the all-baseline models wherein, it could achieve an accuracy of 0.990126, a precision of 0.998175, a recall of 0.982047 and F1-score of 0.990045 respectively. These results indicate that hybrid ensemble learning performs well for complex, imbalanced medical data like ours and increases the diagnostic strength. In addition, the model significantly decreased false negative, which is more applicable to clinical diagnosis and could be crucial for missing cancer patients. The authors conclude that the proposed TCpred_Model may be used as a dependable decision tool for early detection of thyroid cancer and represents a promising base for further development in AI supported healthcare.