SurgTEMP: Temporal-Aware Surgical Video Question Answering with Text-guided Visual Memory for Laparoscopic Cholecystectomy

Surgical procedures are inherently complex and risky, requiring extensive expertise and constant focus to navigate evolving intraoperative scenes. Computer-assisted systems such as surgical visual question answering (VQA) offer promises for education and intraoperative support. Current surgical VQA research largely focuses on static frame analysis, overlooking rich temporal semantics. Surgical video question answering is further challenged by low visual contrast, its highly knowledge-driven nature, diverse analytical needs spanning scattered temporal windows, and the hierarchy from basic perception to high-level intraoperative assessment. To address these challenges, we propose SurgTEMP, a multimodal LLM framework featuring (i) a query-guided token selection module that builds hierarchical visual memory (spatial and temporal memory banks) and (ii) a Surgical Competency Progression (SCP) training scheme. Together, they enable effective modeling of variable-length surgical videos while preserving procedure-relevant cues and temporal coherence, and better support diverse downstream assessment tasks. To support model development, we introduce CholeVidQA-32K, a surgical video question answering dataset comprising 32K open-ended QA pairs and 3,855 video segments (approximately 128 h total) from laparoscopic cholecystectomy. The dataset is organized into a three-level hierarchy -- Perception, Assessment, and Reasoning -- spanning 11 tasks from instrument/action/anatomy perception to Critical View of Safety (CVS), intraoperative difficulty, skill proficiency, and adverse event assessment. In comprehensive evaluations against state-of-the-art open-source multimodal and video LLMs (fine-tuned and zero-shot), SurgTEMP achieves substantial performance improvements, advancing the state of video-based surgical VQA. The project page is available at: https://camma-public.github.io/SurgTEMP/

翻译：外科手术过程本质上复杂且高风险，需要丰富的专业知识和持续专注以应对不断变化的术中场景。计算机辅助系统（如外科视觉问答VQA）为医学教育和术中支持提供了可能性。当前外科VQA研究主要集中于静态帧分析，忽视了丰富的时序语义。外科视频问答面临低视觉对比度、高度知识驱动特性、跨分散时间窗口的多样化分析需求，以及从基础感知到高级术中评估的层级结构等多重挑战。为解决这些问题，我们提出SurgTEMP——一种多模态大语言模型框架，包含：（i）查询引导的令牌选择模块，用于构建层级视觉记忆（空间与时间记忆库）；（ii）手术能力进展（SCP）训练方案。两者协同作用，可在保留手术相关线索和时间连贯性的同时，有效建模可变长度手术视频，并支持多种下游评估任务。为支撑模型开发，我们提出CholeVidQA-32K数据集——包含32,000个开放式问答对和3,855个视频片段（总时长约128小时）的腹腔镜胆囊切除术视频问答数据集。该数据集按感知、评估、推理三个层级组织，涵盖11项任务：从器械/动作/解剖结构感知到关键安全视野（CVS）、术中难度、技能熟练度和不良事件评估。在与当前最先进的开源多模态及视频大语言模型（微调与零样本设置）的全面对比中，SurgTEMP展现出显著性能提升，推动了基于视频的外科VQA领域发展。项目页面详见：https://camma-public.github.io/SurgTEMP/