Abstract:
Absenteeism at workplace plays a crucial factor in demonstrating the productive and
profitable capacity of a company. Thus the knowledge of absenteeism of employees’
becomes the principle for an organization in its multiple dimensions. Because the
proper determination of employees’ profile allows the identification of excesses of
occurrences of certain morbidities. The early absenteeism research primarily focused
on predicting the characteristics and the categories of diseases of employees that make
them perform higher absenteeism at workplace. However, predicting the absenteeism
time of employees using tree-based machine learning classifiers and thus finding out
the facts that should be taken into account to abate higher absenteeism at workplace are
yet to be explored. In this thesis, we have applied three prominent machine learning
algorithms namely Decision Tree, Gradient Boosted Tree, and Random Forest to
predict absenteeism time of employees and to find out the insights that cause employees
to perform higher absenteeism at work. Meanwhile comparing the different machine
learning algorithms to find out the best classifier which produces the highest prediction
accuracy. We have used an existing dataset of a courier company in Brazil in order to
predict the absenteeism time of employees. The dataset contains 21 categories of the
reason for absence which are attested by the International Classification of Disease
(ICD) and 7 other categories without the ICD that have proved to be effective in
detecting the absenteeism at work. We classified the absenteeism time into four
categories such as NOT ABSENT, HOURS, DAYS, and WEEKS. Based on the seven
evaluation metrics such as True Positive, True Negative, False Positive, False Negative,
Sensitivity, Specificity, and Accuracy we have evaluated the model performance in
predicting absenteeism at work. Our comparative analysis found that Gradient Boosted
Tree produces the best result with an accuracy rate of 84.46% whereas Decision Tree
performed the lowest with the accuracy rate of 80.41%. The Random Forest classifier
performs in between with an accuracy rate of 82.43%. Using the tree model we
discovered that the reason for absence class as diseases that are attested by International
Code of Diseases (ICD), and the transportation expense from home to work are the
topmost facts of performing higher absenteeism at workplace.