Abstract:
Cervical cancer is one of the deadliest diseases, causing a significant number of premature deaths in under-developed countries. Several risk factors are responsible for causing cervical cancer. Several organizations and individuals have proposed numerous approaches, as employing machine learning classifiers has become a very common practice in recent years. This study includes a sophisticated predictive model for classifying cervical cancer stages, as well as traditional machine learning classifiers for comparative analysis. This study used a highly imbalanced data, and missing values are present for a number of attributes. The missing value imputation technique, along with the Synthetic Minority Oversampling Technique (SMOTE), was applied to resolve the data imbalance issue. Several feature selection techniques, like Univariate Feature Selection(UFE) and Recursive Feature Elimination (RFE) were employed to determine the most important attributes for the classification outcomes. A comparison of the performance of various machine learning classifiers such as Decision Tree Classifier (DTC), Random Forest Classifier (RFC), Logistic Regression Classifier (LRC), Gaussian Naive Bayes (NBC), K Nearest Neighbors (KNNC), Gradient Boosting Classifier (GBC), AdaBoost Classifier (ABC), XGBoost Classifier (XGBC), and Support Vector Classifier (SVC) before and after the application of sampling and using feature selection methods to exhibit the effectiveness of the classifiers. In the same manner, ensemble methods like Bagging, Boosting, Stacking and Voting Classifier were employed with a view to obtaining an improved score. The application of Hyper Parameter Tuning does the job of getting the best set of parameters for classification. Thus, this work shows a marginal downfall in outcomes after the application of feature selection techniques and significant improvement in ensemble methods. RFC achieved the highest accuracy score of 99.60% after employing the feature selection technique (RFE).