DSpace Repository

Identification Of Hormone Binding Proteins Using Multi-Informative Features Incorporating Ensemble Learning Approach

Show simple item record

dc.contributor.author Punno, Khadiza Islam
dc.date.accessioned 2026-04-25T09:35:17Z
dc.date.available 2026-04-25T09:35:17Z
dc.date.issued 2025-12-30
dc.identifier.citation SWT en_US
dc.identifier.uri http://dspace.daffodilvarsity.edu.bd:8080/handle/123456789/17035
dc.description Thesis Report en_US
dc.description.abstract The growth of genomic and proteomic databases has led to an enormous annotation gap in which the amount of uncharacterised protein sequences has dramatically increased relative to the ability of slow and costly experimental classification processes. This will require creation of speedy, precise as well as scalable computing instruments. This thesis solves this challenge by undertaking an in-depth comparative analysis of the case to determine the best machine learning model to classify proteins using sequence-based derived features. An equal set of 3,570 protein sequences (1,785 in each class) was prefiltered and fed into a large-scale feature engineering pipeline with the ifeature library. A total of 15 different feature sets such as Amino Acid Composition (AAC), Dipeptide Composition (DPC), Tripeptide Composition (TPC) and a host of autocorrelation and physicochemical properties were concatenated to yield 11,466 dimensions each protein. Nine different machine learning and deep learning models were trained and heavily tested on this high- dimensional data: Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Random Forest (RF), XGBoost (XGB), LightGBM (LGBM), a custom Ensemble (RF+XGB+DT), Artificial Neural Network (ANN), Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN). The models were tested on a held-out test set. The findings showed that the random forest (RF) model was far the best classifier with high performance according to all the essential measures, as well as an Accuracy of 0.8163, Sensitivity of 0.8400, and an AUC of 0.8350. It is important to note that more complex boosting (XGB, LGBM) and deep learning (ANN, CNN) models performed worse, whereas the RNN architecture was unable to identify the meaningful patterns in the case of the fixed feature vector. This paper gives a concise, empirical reference point of this high dimensional bioinformatics exercise, and the determination was that the Random Forest classifier is the strongest and the most efficient model to this particular feature-based technique. This observation is a useful, practical suggestion to researchers creating such protein classification pipelines. en_US
dc.description.sponsorship DIU en_US
dc.language.iso en_US en_US
dc.publisher Daffodil International University en_US
dc.subject Feature Engineering en_US
dc.subject Hormone Binding Proteins en_US
dc.subject Protein Function en_US
dc.subject Prediction Ensemble en_US
dc.subject Learning Bioinformatics en_US
dc.title Identification Of Hormone Binding Proteins Using Multi-Informative Features Incorporating Ensemble Learning Approach en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search DSpace


Browse

My Account