SUREON：一个用于外科推理的基准与视觉-语言模型 (SUREON: A Benchmark and Vision-Language-Model for Surgical Reasoning)

Surgeons don't just see -- they interpret. When an expert observes a surgical scene, they understand not only what instrument is being used, but why it was chosen, what risk it poses, and what comes next. Current surgical AI cannot answer such questions, largely because training data that explicitly encodes surgical reasoning is immensely difficult to annotate at scale. Yet surgical video lectures already contain exactly this -- explanations of intent, rationale, and anticipation, narrated by experts for the purpose of teaching. Though inherently noisy and unstructured, these narrations encode the reasoning that surgical AI currently lacks. We introduce SUREON, a large-scale video QA dataset that systematically harvests this training signal from surgical academic videos. SUREON defines 12 question categories covering safety assessment, decision rationale, and forecasting, and uses a multi-agent pipeline to extract and structure supervision at scale. Across 134.7K clips and 170 procedure types, SUREON yields 206.8k QA pairs and an expert-validated benchmark of 354 examples. To evaluate the extent to which this supervision translates to surgical reasoning ability, we introduce two models: SureonVLM, a vision-language model adapted through supervised fine-tuning, and SureonVLM-R1, a reasoning model trained with Group Relative Policy Optimization. Both models can answer complex questions about surgery and substantially outperform larger general-domain models, exceeding 84% accuracy on the SUREON benchmark while outperforming general-domain models on standard surgical perception tasks. Qualitative analysis of SureonVLM-R1 reveals explicit reasoning behavior, such as inferring operative intent from visual context.

翻译：外科医生不仅在看——更在解读。当专家观察手术场景时，他们不仅理解正在使用何种器械，更明白为何选择该器械、其带来何种风险以及后续步骤为何。当前的外科人工智能尚无法回答此类问题，主要因为明确编码外科推理的训练数据极难进行大规模标注。然而，外科教学视频本身恰恰包含了这些内容——由专家为教学目的讲述的关于意图、原理和预判的解说。尽管这些解说本质上具有噪声且非结构化，但它们编码了当前外科人工智能所欠缺的推理能力。我们提出了SUREON，一个大规模视频问答数据集，该系统性地从外科学术视频中提取此类训练信号。SUREON定义了涵盖安全性评估、决策原理和预测的12类问题，并采用多智能体流程大规模提取并结构化监督信息。基于134.7K个视频片段和170种手术类型，SUREON生成了206.8K个问答对及包含354个样本的专家验证基准。为评估此类监督在多大程度上转化为外科推理能力，我们引入了两个模型：通过监督微调适配的视觉-语言模型SureonVLM，以及采用群体相对策略优化训练的外科推理模型SureonVLM-R1。两个模型均能回答复杂的外科问题，并显著超越更大的通用领域模型，在SUREON基准上准确率超过84%，同时在标准外科感知任务上优于通用领域模型。对SureonVLM-R1的定性分析揭示了其显式的推理行为，例如从视觉上下文中推断手术意图。