Robotic-assisted surgery (RAS) is central to modern surgery, driving the need for intelligent systems with accurate scene understanding. Most existing surgical AI methods rely on isolated, task-specific models, leading to fragmented pipelines with limited interpretability and no unified understanding of RAS scene. Vision-Language Models (VLMs) offer strong zero-shot reasoning, but struggle with hallucinations, domain gaps and weak task-interdependency modeling. To address the lack of unified data for RAS scene understanding, we introduce SurgCoTBench, the first reasoning-focused benchmark in RAS, covering 14256 QA pairs with frame-level annotations across five major surgical tasks. Building on SurgCoTBench, we propose SurgRAW, a clinically aligned Chain-of-Thought (CoT) driven agentic workflow for zero-shot multi-task reasoning in surgery. SurgRAW employs a hierarchical reasoning workflow where an orchestrator divides surgical scene understanding into two reasoning streams and directs specialized agents to generate task-level reasoning, while higher-level agents capture workflow interdependencies or ground output clinically. Specifically, we propose a panel discussion mechanism to ensure task-specific agents collaborate synergistically and leverage on task interdependencies. Similarly, we incorporate a retrieval-augmented generation module to enrich agents with surgical knowledge and alleviate domain gaps in general VLMs. We design task-specific CoT prompts grounded in surgical domain to ensure clinically aligned reasoning, reduce hallucinations and enhance interpretability. Extensive experiments show that SurgRAW surpasses mainstream VLMs and agentic systems and outperforms a supervised model by 14.61% accuracy. Dataset and code is available at https://github.com/jinlab-imvr/SurgRAW.git .
翻译:机器人辅助手术(RAS)是现代外科手术的核心,这推动了对具备精确场景理解能力的智能系统的需求。现有的大多数手术人工智能方法依赖于孤立、任务特定的模型,导致流程碎片化,可解释性有限,且缺乏对RAS场景的统一理解。视觉-语言模型(VLMs)具备强大的零样本推理能力,但存在幻觉、领域鸿沟以及任务间依赖性建模薄弱等问题。为解决RAS场景理解领域缺乏统一数据的问题,我们引入了SurgCoTBench,这是首个专注于推理的RAS基准测试,涵盖五个主要手术任务,包含14256个带有帧级标注的问答对。基于SurgCoTBench,我们提出了SurgRAW,这是一个临床对齐的、由思维链(CoT)驱动的智能体工作流,用于手术中的零样本多任务推理。SurgRAW采用分层推理工作流,其中一个协调器将手术场景理解划分为两个推理流,并指导专门的智能体生成任务级推理,而更高级别的智能体则捕捉工作流间的相互依赖性或将输出结果在临床上进行锚定。具体而言,我们提出了一种小组讨论机制,以确保任务特定的智能体能够协同合作,并利用任务间的相互依赖性。同样,我们引入了一个检索增强生成模块,以丰富智能体的手术知识,并缓解通用VLMs中的领域鸿沟问题。我们设计了基于手术领域的任务特定CoT提示,以确保临床对齐的推理、减少幻觉并增强可解释性。大量实验表明,SurgRAW超越了主流VLMs和智能体系统,并在准确率上优于监督模型14.61%。数据集和代码可在 https://github.com/jinlab-imvr/SurgRAW.git 获取。