dc.description.abstract |
In my thesis project, "Lung Cancer Prediction Using Machine Learning Techniques," I
aimed to develop a reliable system for predicting lung cancer risk through the application
of various machine learning algorithms. The dataset utilized was sourced from Kaggle,
originating from an online lung cancer prediction system. It comprised multiple attributes
related to individuals' demographics, lifestyle choices, and health symptoms, with a binary
target variable indicating the presence or absence of lung cancer. Initially, I preprocessed
the dataset, converting certain column values to binary (0 and 1) and addressing missing
values. During exploratory data analysis, I identified an imbalance in the target distribution
and mitigated it using oversampling techniques. Additionally, I performed feature
engineering by eliminating irrelevant features and creating new ones to enhance predictive
capability. To reduce dimensionality, I employed Principal Component Analysis (PCA)
before training several machine learning models including Logistic Regression, Decision
Tree, K Nearest Neighbor, Multinomial Naive Bayes, Support Vector Classifier, and Multi-
layer Perceptron classifier. Among these models, Logistic Regression emerged as the top
performer, achieving an accuracy of 95%. Subsequently, I applied Grid Search on Logistic
Regression to optimize hyperparameters, resulting in a slight accuracy improvement to
94.89%. Despite experimenting with ensemble techniques like Voting Classifier, Logistic
Regression consistently outperformed other models. Finally, I conducted K-Fold cross-
validation to validate model robustness, with Logistic Regression demonstrating the
highest average accuracy compared to Decision Tree and Multi-layer Perceptron. In
conclusion, my research highlights Logistic Regression as the most effective model for
lung cancer risk prediction, emphasizing its accuracy and reliability based on the given
dataset and features. |
en_US |