Explainable Multimodal Hybrid Framework for Cervical Cancer Detection via Vision Transformers and LLM-Based Clinical Feature  Fusion

Alo, Alaya Parven

DSpace Home
→
Faculty of Science and Information Technology
→
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
→
Project Report
→
View Item

dc.contributor.author	Alo, Alaya Parven
dc.date.accessioned	2026-03-31T02:35:18Z
dc.date.available	2026-03-31T02:35:18Z
dc.date.issued	2025-09-17
dc.identifier.uri	http://dspace.daffodilvarsity.edu.bd:8080/handle/123456789/16525
dc.description	Project Report	en_US
dc.description.abstract	Cervical cancer is still among the major female causes of death globally, particularly in regions with limited access to healthcare. Although deep learning has been prospective for computer-aided diagnosis, its lack of interpretability and limited use of heterogeneous data types have limited clinical acceptance. In this paper, we present an interpretable multimodal hybrid deep learning architecture that combines Vision Transformers (ViT) with clinical metadata via Large Language Model (LLM)-assisted fusion. Visual features are learned from Pap smear images with ViT and enhanced predictive accuracy is achieved with the incorporation of structured clinical inputs such as patient history, HPV status, and cytology scores, regulated through an LLM-based attention mechanism.For feature imbalance handling and enhanced interpretability, we apply a dual-branch framework wherein image and text streams are conjoined via a semantic fusion layer for facilitating cross-modal alignment. Transparency is achieved by employing Lime, Saliency maps, and Grad-CAM visualizations that allow clinicians to trace-back predictions onto predictive image regions and metadata attributes. Comprehensive testing on benchmark datasets including SIPaKMeD and Herlev indicates that our approach achieves higher classification accuracy (99.93%) than single CNNs, ViTs, and classic machine learning models. Furthermore, hybrid models such as ViT-MobileNet and ViT-BERT are superior in generalization and outperform state-of-the-art methods for cell-level classification and patient-level classification tasks.Overall, the system enhances cervical cancer screening by providing a clear, stable, and clinically viable AI solution for early detection. Future directions include adding information on histopathology and real-time deployment through web-based diagnostic tools.	en_US
dc.description.sponsorship	Daffodil International University	en_US
dc.language.iso	en_US	en_US
dc.publisher	Daffodil International University	en_US
dc.subject	Cervical Cancer Detection	en_US
dc.subject	Vision Transformer (ViT)	en_US
dc.subject	Multimodal Deep Learning	en_US
dc.subject	Large Language Model (LLM)	en_US
dc.title	Explainable Multimodal Hybrid Framework for Cervical Cancer Detection via Vision Transformers and LLM-Based Clinical Feature Fusion	en_US
dc.type	Other	en_US