DSpace Repository

Performance of Modern Vision Backbones for Surgical Phase Recognition Without Fine- Tuning

Show simple item record

dc.contributor.author Mimi, Maysha Mahjabin
dc.date.accessioned 2026-04-22T05:57:51Z
dc.date.available 2026-04-22T05:57:51Z
dc.date.issued 2025-12-27
dc.identifier.citation SWT en_US
dc.identifier.uri http://dspace.daffodilvarsity.edu.bd:8080/handle/123456789/16983
dc.description Thesis Report en_US
dc.description.abstract The thesis studies the skills of lightweight backbones of vision in surgical phases recognition in a fixed CLIP-like framework without fine-tuning. In the video of laparoscopic cholecystectomy in Cholec80 dataset, we sample 1fps of the images in the video, resize and suppose are resized to 224x224and CLIP-normalised and matched with textual phase prompts, where the text is the name of standard procedures. It has a two-encoder architecture where both CLIP text tower and vision backbones, ViT-Tiny, ViT-Small US-CTC7, ConvNeXt-Tiny and Swin-Transformer-Tiny are frozen and it is trained on a shallow projection head with symmetric contrastive loss (InfoNCE). All backbones are shown to perform well with Image- text retrieval (Recall@1/5/10) and zero-shot phase classification (Top-1/5/10), per-class precision/recall/F1 and confusion matrices and throughput, number of parameters, memory usage with ConvNeXt-Tiny being best (0.64-0.65), ViT-Small being average (0.56), and ViT- Tiny being the worst (.49). Based on error studies in the frozen-encoder setting, confusion (e.g. Triangle Dissection vs. Gallbladder Dissection / Clipping and Cutting) and predictive misalignment on the instance of Packaging uniformly across ViT models indicate difference in architectural inductive bias and capacity. In total, ConvNeXt-Tiny can achieve the best accuracy-efficiency trade-off and Swin-Tiny is a model that could be utilized in a similar way in situations where hierarchical spatial context is required. The article gives a controlled trial of four backbones of lightweight frozen in CLIP, a head-only training and incremental retraining pipeline that is supposed to be utilized in low-resource clinical settings; and we also make suggestions in practice on the selection of compact deployable backbones in frame-level surgical phase recognition. These outcomes demystify the useability of new freeze encoders of multimodal video based surgical knowledge. en_US
dc.description.sponsorship DIU en_US
dc.language.iso en_US en_US
dc.publisher Daffodil International University en_US
dc.subject Medical Image Analysis en_US
dc.subject Vision Backbones en_US
dc.subject Surgical Phase Recognition en_US
dc.subject Transfer Learning (No Fine-Tuning) en_US
dc.title Performance of Modern Vision Backbones for Surgical Phase Recognition Without Fine- Tuning en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search DSpace


Browse

My Account