| dc.description.abstract |
The growth of genomic and proteomic databases has led to an enormous annotation gap in which the amount of uncharacterised protein sequences has dramatically increased relative to the ability of slow and costly experimental classification processes. This will require creation of speedy, precise as well as scalable computing instruments. This thesis solves this challenge by undertaking an in-depth comparative analysis of the case to determine the best machine learning model to classify proteins using sequence-based derived features. An equal set of 3,570 protein sequences (1,785 in each class) was prefiltered and fed into a large-scale feature engineering pipeline with the ifeature library. A total of 15 different feature sets such as Amino Acid Composition (AAC), Dipeptide Composition (DPC), Tripeptide Composition (TPC) and a host of autocorrelation and physicochemical properties were concatenated to yield 11,466 dimensions each protein. Nine different machine learning and deep learning models were trained and heavily tested on this high- dimensional data: Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Random Forest (RF), XGBoost (XGB), LightGBM (LGBM), a custom Ensemble (RF+XGB+DT), Artificial Neural Network (ANN), Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN). The models were tested on a held-out test set. The findings showed that the random forest (RF) model was far the best classifier with high performance according to all the essential measures, as well as an Accuracy of 0.8163, Sensitivity of 0.8400, and an AUC of 0.8350. It is important to note that more complex boosting (XGB, LGBM) and deep learning (ANN, CNN) models performed worse, whereas the RNN architecture was unable to identify the meaningful patterns in the case of the fixed feature vector. This paper gives a concise, empirical reference point of this high dimensional bioinformatics exercise, and the determination was that the Random Forest classifier is the strongest and the most efficient model to this particular feature-based technique. This observation is a useful, practical suggestion to researchers creating such protein classification pipelines. |
en_US |