| dc.description.abstract |
Understanding the response of cancer cells to various drugs is emerging as one of the crucial issues of contemporary precise medicine. As there is a lot of data on gene expression now, there is an increased potential to apply machine learning in order to learn more accurate patterns of drug sensitivity. In this thesis, I take advantage of the potential of basal gene expression patterns to be predictive of drug response, primarily of the IC50 values, in a collection of supervised ML models. The work relies on the GDSC data, that offers extensive information on the ideas of the expression and the vulnerability of tumor cells on a multitude of medications. My process involves preprocessing of high dimensional gene expression information, properly matching it with drug response labels and subsequently training various models (Random Forest, XGBoost, and MLP) and observing which model works best. In the process, I also assess the impact of feature scaling, data sampling as well as hyperparameter tuning so as to comprehend what impact each step has on the final result. The findings indicate that although the prediction of drug response remains a highly difficult exercise because of the noise and complexity of the data, there are always models which are more effective than others. Specifically, the accuracy of XGBoost and MLP increases by a small margin in an unfolding technique, however, their overall performance makes it obvious how challenging it is to model direct gene-to-drug causality using such high-dimensional, biological data. Despite those, the work remains useful with its provision of a reproducible pipeline to work with gene expression-based drug prediction tasks, as well as insights into the preprocessing and modeling choices that have the most significant impact. This thesis also argues the point that existing models fall short and how these strategies may be refined in future, such as selecting features, elaborating neural architectures, or including other omics data may potentially improve the model. In general, the project provides a demonstration of machine learning application in drug response prediction in a practical, hands-on way, as well as demonstrates actual difficulties that researchers encounter when working with biological data. |
en_US |