Visual Question Answering (VQA) in robotic surgery, referred to as surgical VQA, requires high-level understanding of complex surgical scenes and the integration of visual perception with language reasoning, with the potential to support surgical training and intraoperative decision-making. Recent Vision-Language Models (VLMs) have shown promising performance through parameter-efficient fine-tuning; however, most existing approaches rely on coarse visual grounding, typically limited to bounding boxes, which fails to capture the fine-grained spatial structure of surgical objects. In this work, we propose a unified framework that jointly performs pixel-level segmentation and visual question answering within a single framework. Our approach integrates a VLM with a Segment Anything Model (SAM)-based decoder and represents scene elements as object tokens generated by the VLM. These object tokens guide answer prediction and are further projected to the SAM-based decoder to produce segmentation masks. By optimizing the object token embeddings through both segmentation and question answering objectives, the model learns spatially grounded representations that enhance visual reasoning while providing explicit pixel-level grounding. We evaluate the proposed method on the private RAMIE (Robot-Assisted Minimally Invasive Esophagectomy) dataset and the public EndoVis18 dataset, where it consistently outperforms baseline methods for surgical VQA. These results demonstrate that incorporating context-aware object tokens into vision-language models improves fine-grained surgical scene understanding.
翻译:[translated abstract in Chinese]
机器人手术中的视觉问答(VQA),即手术VQA,要求对复杂手术场景进行高层次理解,并融合视觉感知与语言推理能力,有望支持手术训练和术中决策。近期通过参数高效微调,视觉语言模型(VLM)展现了良好性能;然而,现有方法大多依赖粗粒度视觉定位(通常局限于边界框),难以捕捉手术目标的精细空间结构。本研究提出统一框架,在单一框架内联合执行像素级分割与视觉问答。该方法将VLM与基于Segment Anything Model(SAM)的解码器集成,以VLM生成的目标标记表征场景元素。这些目标标记用于指导答案预测,并进一步投影至基于SAM的解码器以生成分割掩膜。通过分割与问答任务对目标标记嵌入的联合优化,模型学习到增强视觉推理能力的空间化表征,同时提供显式像素级定位。在私有RAMIE(机器人辅助微创食管切除术)数据集与公开EndoVis18数据集上,该方法在手术VQA任务中持续优于基线方法。结果表明,将上下文感知的目标标记融入视觉语言模型可提升细粒度手术场景理解能力。