Identification Of Hormone Binding Proteins Using Multi-Informative Features Incorporating Ensemble Learning Approach

Punno, Khadiza Islam

DSpace Home
→
Faculty of Science and Information Technology
→
DEPARTMENT OF SOFTWARE ENGINEERING
→
Thesis Report
→
View Item

dc.contributor.author	Punno, Khadiza Islam
dc.date.accessioned	2026-04-25T09:35:17Z
dc.date.available	2026-04-25T09:35:17Z
dc.date.issued	2025-12-30
dc.identifier.citation	SWT	en_US
dc.identifier.uri	http://dspace.daffodilvarsity.edu.bd:8080/handle/123456789/17035
dc.description	Thesis Report	en_US
dc.description.abstract	The growth of genomic and proteomic databases has led to an enormous annotation gap in which the amount of uncharacterised protein sequences has dramatically increased relative to the ability of slow and costly experimental classification processes. This will require creation of speedy, precise as well as scalable computing instruments. This thesis solves this challenge by undertaking an in-depth comparative analysis of the case to determine the best machine learning model to classify proteins using sequence-based derived features. An equal set of 3,570 protein sequences (1,785 in each class) was prefiltered and fed into a large-scale feature engineering pipeline with the ifeature library. A total of 15 different feature sets such as Amino Acid Composition (AAC), Dipeptide Composition (DPC), Tripeptide Composition (TPC) and a host of autocorrelation and physicochemical properties were concatenated to yield 11,466 dimensions each protein. Nine different machine learning and deep learning models were trained and heavily tested on this high- dimensional data: Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Random Forest (RF), XGBoost (XGB), LightGBM (LGBM), a custom Ensemble (RF+XGB+DT), Artificial Neural Network (ANN), Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN). The models were tested on a held-out test set. The findings showed that the random forest (RF) model was far the best classifier with high performance according to all the essential measures, as well as an Accuracy of 0.8163, Sensitivity of 0.8400, and an AUC of 0.8350. It is important to note that more complex boosting (XGB, LGBM) and deep learning (ANN, CNN) models performed worse, whereas the RNN architecture was unable to identify the meaningful patterns in the case of the fixed feature vector. This paper gives a concise, empirical reference point of this high dimensional bioinformatics exercise, and the determination was that the Random Forest classifier is the strongest and the most efficient model to this particular feature-based technique. This observation is a useful, practical suggestion to researchers creating such protein classification pipelines.	en_US
dc.description.sponsorship	DIU	en_US
dc.language.iso	en_US	en_US
dc.publisher	Daffodil International University	en_US
dc.subject	Feature Engineering	en_US
dc.subject	Hormone Binding Proteins	en_US
dc.subject	Protein Function	en_US
dc.subject	Prediction Ensemble	en_US
dc.subject	Learning Bioinformatics	en_US
dc.title	Identification Of Hormone Binding Proteins Using Multi-Informative Features Incorporating Ensemble Learning Approach	en_US
dc.type	Thesis	en_US