Surgical video understanding is essential for computer-assisted interventions, yet existing surgical foundation models remain constrained by limited data scale, procedural diversity, and inconsistent evaluation, often lacking a reproducible training pipeline. We propose SurgRec, a scalable and reproducible pretraining recipe for surgical video understanding, instantiated with two variants: SurgRec-MAE and SurgRec-JEPA. We curate a large multi-source corpus of 10,535 videos and 214.5M frames spanning endoscopy, laparoscopy, cataract, and robotic surgery. Building on this corpus, we develop a unified pretraining pipeline with balanced sampling and standardize a reproducible benchmark across 16 downstream datasets and four clinical domains with consistent data splits. Across extensive comparisons against SSL baselines and vision-language models, SurgRec consistently achieves superior performance across downstream datasets. In contrast, VLMs prove unreliable for fine-grained temporal recognition, exhibiting both performance gaps and sensitivity to prompt phrasing. Our work provides a reproducible, scalable foundation for the community to build more general surgical video models. All code, models, and data will be publicly released.
翻译:手术视频理解对于计算机辅助手术至关重要,然而现有手术基础模型仍受限于数据规模有限、手术流程多样性不足以及评估标准不一致等问题,且往往缺乏可复现的训练流程。我们提出SurgRec——一种可扩展且可复现的手术视频理解预训练方案,并实例化为两个变体:SurgRec-MAE和SurgRec-JEPA。我们构建了一个大规模多源数据集,涵盖内窥镜、腹腔镜、白内障及机器人手术,包含10,535个视频和2.145亿帧图像。基于此数据集,我们开发了具有平衡采样的统一预训练流程,并在16个下游数据集和四个临床领域上,通过一致的数据划分建立了标准化的可复现基准。通过与自监督学习基线及视觉语言模型的广泛对比,SurgRec在下游数据集上始终取得更优性能。相比之下,视觉语言模型在细粒度时序识别任务中表现不可靠,既存在性能差距又对提示措辞敏感。本工作为社区构建更通用的手术视频模型提供了可复现、可扩展的基础。所有代码、模型和数据将公开发布。