| dc.description.abstract |
Cervical cancer is still among the major female causes of death globally, particularly
in regions with limited access to healthcare. Although deep learning has been
prospective for computer-aided diagnosis, its lack of interpretability and limited use
of heterogeneous data types have limited clinical acceptance. In this paper, we
present an interpretable multimodal hybrid deep learning architecture that
combines Vision Transformers (ViT) with clinical metadata via Large Language
Model (LLM)-assisted fusion. Visual features are learned from Pap smear images
with ViT and enhanced predictive accuracy is achieved with the incorporation of
structured clinical inputs such as patient history, HPV status, and cytology scores,
regulated through an LLM-based attention mechanism.For feature imbalance
handling and enhanced interpretability, we apply a dual-branch framework wherein
image and text streams are conjoined via a semantic fusion layer for facilitating
cross-modal alignment. Transparency is achieved by employing Lime, Saliency
maps, and Grad-CAM visualizations that allow clinicians to trace-back predictions
onto predictive image regions and metadata attributes. Comprehensive testing on
benchmark datasets including SIPaKMeD and Herlev indicates that our approach
achieves higher classification accuracy (99.93%) than single CNNs, ViTs, and classic
machine learning models. Furthermore, hybrid models such as ViT-MobileNet and
ViT-BERT are superior in generalization and outperform state-of-the-art methods
for cell-level classification and patient-level classification tasks.Overall, the system
enhances cervical cancer screening by providing a clear, stable, and clinically viable
AI solution for early detection. Future directions include adding information on
histopathology and real-time deployment through web-based diagnostic tools. |
en_US |