Surgery is a highly complex process, and artificial intelligence has emerged as a transformative force in supporting surgical guidance and decision-making. However, the unimodal nature of most current AI systems limits their ability to achieve a holistic understanding of surgical workflows. This highlights the need for general-purpose surgical AI systems capable of comprehensively modeling the interrelated components of surgical scenes. Recent advances in large vision-language models that integrate multimodal data processing offer strong potential for modeling surgical tasks and providing human-like scene reasoning and understanding. Despite their promise, systematic investigations of VLMs in surgical applications remain limited. In this study, we evaluate the effectiveness of large VLMs for the fundamental surgical vision task of detecting surgical tools. Specifically, we investigate three state-of-the-art VLMs, Qwen2.5, LLaVA1.5, and InternVL3.5, on the GraSP robotic surgery dataset under both zero-shot and parameter-efficient LoRA fine-tuning settings. Our results demonstrate that Qwen2.5 consistently achieves superior detection performance in both configurations among the evaluated VLMs. Furthermore, compared with the open-set detection baseline Grounding DINO, Qwen2.5 exhibits stronger zero-shot generalization and comparable fine-tuned performance. Notably, Qwen2.5 shows superior instrument recognition, while Grounding DINO demonstrates stronger localization.
翻译:外科手术是一个高度复杂的过程,人工智能已成为支持手术引导和决策制定的变革性力量。然而,当前大多数人工智能系统的单模态特性限制了其实现对外科手术工作流程整体理解的能力。这突显了对能够全面建模手术场景中相互关联组件的通用外科手术人工智能系统的需求。近期在整合多模态数据处理的大型视觉语言模型方面取得的进展,为建模外科手术任务及提供类人的场景推理与理解提供了巨大潜力。尽管前景广阔,但针对视觉语言模型在外科手术应用中系统性的研究仍然有限。在本研究中,我们评估了大型视觉语言模型在检测外科手术工具这一基础外科视觉任务中的有效性。具体而言,我们在GraSP机器人手术数据集上,研究了三种最先进的视觉语言模型——Qwen2.5、LLaVA1.5和InternVL3.5——在零样本和参数高效的LoRA微调设置下的表现。我们的结果表明,在所评估的视觉语言模型中,Qwen2.5在两种配置下均能持续实现卓越的检测性能。此外,与开放集检测基线模型Grounding DINO相比,Qwen2.5展现出更强的零样本泛化能力和可比的微调后性能。值得注意的是,Qwen2.5在器械识别方面表现出色,而Grounding DINO则在定位方面表现更强。