Abstract:
Multi-modal learning systems are increasingly being conditioned on pretrained text encoders to condition visual representations, although the behavioral implications of text encoder size and depth under fixed-prompt and low-data conditions remain poorly understood.The paper discusses the behavior of compact transformer encoders with a fixed prompt multi-modal learning system, and in surgery phase recognition as a controlled assessment task in seven stages including Preparation, Calot Triangle Dissection, Clipping Cutting, Gallbladder Dissection, Gallbladder Packaging, Cleaning Coagulation and Gallbladder Retraction.We analyze the issue of whether text size compact pretrained encoders (MiniLM-L3, MiniLM- L6, MiniLM-L12, and DistilBERT) can be helpful in preserving multimodal alignment in conditions of textual inflexibility. To make the behavior of text encoders isolable, it is frozen and only the text encoder (a lightweight 512-dimensional projection head) and a trainable temperature are learned in a symmetric contrastive (InfoNCE) objective. The dataset utilized in experiments is Cholec80 (80 videos), the frame rate is used, 1 FPS, phase prompts are fixed, pre-processing is light standardized, and the train/validation/test splits are video-wise. To measure the performance of the models and the confusion matrix analysis to understand the behaviour of per-phase alignment, the top-k (Top-1/5/10) accuracy based on the image-to-text nearest-prompt classification is applied. The results show that Top-1 is much moreaccurate in Encoder depths: MiniLM-L3 (approximately 44%), DistilBERT (approximately 39-40) and MiniLM-L6 (approximately 39) have the highest accuracy, and the accuracy of deeper MiniLM-L12 is much lower (approximately 24-25%). Despite these differences, Top- 5 (approximately9496) and near-perfect (approximately100) accuracies of all models are strong indicating that correct prompts tend to be in close semantic similarity. Interestingly, encoders with higher levels of sensitivity are more sensitive to limited supervision but low sensitivity encoders are more stable to fixed-prompt constraints. These findings highlight the point that in cases where depth of the transformers is increased multimodal alignment is not invariably supported in low-data fixed-prompt cases. Instead, smaller encoders can perhaps act more strongly and reliably, offers useful empirical guidance to the choice of encoder of text in restricted multimodal learning situations. The limitations of the study are the unequal representation of classes in the database, visual overlap of certain stages of surgery, and the failure to refine the vision encoder.