Abstract:
Surgical Phase Recognition (SPR) is essential in supporting intraoperative decision support, minimally invasive surgery training, and operating room workflow. However, the correct identification of phases in laparoscopic cholecystectomy videos is a difficult task due to the weak visual features between phases, commonly occurring occlusions, imbalance in classes, and unavailability of large-scale annotated surgical videos. More recent approaches have been exploiting large vision-language models like the Contrastive Language-Image Pretraining (CLIP), where the fusion mechanisms are trained on large general-purpose datasets and might not be transferred to a small-data, domain-specific medical task. To overcome theseshortcomings in the present paper, we recast CLIP-style multimodal fusion under low-data settings of surgical video analysis by comparing lightweight fusion strategies such as additive fusion, concatenation-based multilayer perception (MLP) fusion, gated fusion, and shared- projection fusion—on a frozen CLIP-based backbone. Frame-level phase classification experiments on the Cholec80 laparoscopic cholecystectomy data set indicate that simple fusion, where additive fusion yields optimal robustness by trading off a less complex number of parameters and reduced overfitting, and gated fusion show competitive generalization ability in adaptive modality integration. Meanwhile, due to data scarcity, they are more sensitive, and concatenation-based or shared-projection fusion exhibits inferior unstable behaviors. The error-flow analysis also reveals the deep-seated confusions across visually similar stages, e.g., Gallbladder Dissection and Cleaning and Coagulation, and this illustrates the physical constraints of the visual evidence when data are scarce. Comprehensively, this thesis proves that effective multimodal surgical phase identification in clinical practice must reconsider the design of fusion rather than further increase the model complexity, and lightweight fusion schemes are a feasible and practical solution to low resource setting surgical video analysis.