Abstract:
The thesis studies the skills of lightweight backbones of vision in surgical phases recognition in a fixed CLIP-like framework without fine-tuning. In the video of laparoscopic cholecystectomy in Cholec80 dataset, we sample 1fps of the images in the video, resize and suppose are resized to 224x224and CLIP-normalised and matched with textual phase prompts, where the text is the name of standard procedures. It has a two-encoder architecture where both CLIP text tower and vision backbones, ViT-Tiny, ViT-Small US-CTC7, ConvNeXt-Tiny and Swin-Transformer-Tiny are frozen and it is trained on a shallow projection head with symmetric contrastive loss (InfoNCE). All backbones are shown to perform well with Image- text retrieval (Recall@1/5/10) and zero-shot phase classification (Top-1/5/10), per-class precision/recall/F1 and confusion matrices and throughput, number of parameters, memory usage with ConvNeXt-Tiny being best (0.64-0.65), ViT-Small being average (0.56), and ViT- Tiny being the worst (.49). Based on error studies in the frozen-encoder setting, confusion (e.g. Triangle Dissection vs. Gallbladder Dissection / Clipping and Cutting) and predictive misalignment on the instance of Packaging uniformly across ViT models indicate difference in architectural inductive bias and capacity. In total, ConvNeXt-Tiny can achieve the best accuracy-efficiency trade-off and Swin-Tiny is a model that could be utilized in a similar way in situations where hierarchical spatial context is required. The article gives a controlled trial of four backbones of lightweight frozen in CLIP, a head-only training and incremental retraining pipeline that is supposed to be utilized in low-resource clinical settings; and we also make suggestions in practice on the selection of compact deployable backbones in frame-level surgical phase recognition. These outcomes demystify the useability of new freeze encoders of multimodal video based surgical knowledge.