Abstract:
RNA-binding proteins (RBPs) are essential for various cellular processes, such as splicing and translation regulation. Accurate identification of RBPs is crucial for advancing biological research and drug discovery. However, this task is challenging due to subtle patterns in protein sequences and the limitations of existing machine learning models, which often suffer from overfitting and poor generalization. This research introduces an ensemble learning model leveraging Bayesian inference to address these challenges. The model incorporates multiple feature extraction methods, including ProtBert, ESM2, LSA, and graph-based techniques like Node2Vec. These methods capture diverse characteristics of protein sequences, enhancing prediction accuracy. Bayesian inference optimally combines the outputs of individual classifiers—SVM, Random Forest, and Decision Trees—to improve reliability and reduce overfitting. Prediction scores from prior research highlight the potential of advanced models: ESM2 achieved an accuracy of 91%, CrossBind obtained 89%, and Granular Multiple Kernel Learning reported 88%. Building upon these benchmarks, the proposed ensemble model is expected to outperform existing methods, achieving superior accuracy and robustness. This research demonstrates the potential of ensemble learning with Bayesian inference as a transformative approach for RBP identification.