Abstract:
Thalassemia is a genetic blood disease inherited from parents. It is the most common and
concerning genetic disorder globally. Minor to major anemia and transfusion dependence
is the main symptom of this disease. In South Asian countries like Bangladesh, every year
there are many children born with thalassemia traits. Among various types of thalassemia,
beta-thalassemia is the most severe one that causes weakness, serious anemia, shortness of
breath, even failing organs like the kidney, heart. This study aims to classify thalassemia
depending on the values of various hemoglobin (Hb) indices like Hb A, Hb B, Hb E, and
Hb F collected from the data of a thalassemia center of Bangladesh. This work is to depict
the epidemiological aspects of thalassemia from the data of the common people of all
stages of Bangladesh. We applied various machine learning classifiers such as Logistic
Regression (LR), Decision Tree, Support Vector Machine (SVM), Random Forest, and KNearest Neighbors (KNN), etc. to classify thalassemia. For evaluating the performance of
the classifiers, we calculated accuracy, precision, recall and f1-score. We also plotted the
ROC curve. From the ROC curve, it is observed that AUC (Area Under the Curve) has a
big area. After conducting the study, we got the final result that concludes that among all
the algorithms, the Random Forest and K-Nearest Neighbors (KNN) have shown the best
accuracy which is 99.14%. The both precision and recall for the Random Forest is 99.00%
and for KNN is 99.00% and 100% respectively.