Surgical video understanding is essential for computer-assisted interventions, yet existing surgical foundation models remain constrained by limited data scale, procedural diversity, and inconsistent evaluation, often lacking a reproducible training pipeline. We propose SurgRec, a scalable and reproducible pretraining recipe for surgical video understanding, instantiated with two variants: SurgRec-MAE and SurgRec-JEPA. We curate a large multi-source corpus of 10,535 videos and 214.5M frames spanning endoscopy, laparoscopy, cataract, and robotic surgery. Building on this corpus, we develop a unified pretraining pipeline with balanced sampling and standardize a reproducible benchmark across 16 downstream datasets and four clinical domains with consistent data splits. Across extensive comparisons against SSL baselines and vision-language models, SurgRec consistently achieves superior performance across downstream datasets. In contrast, VLMs prove unreliable for fine-grained temporal recognition, exhibiting both performance gaps and sensitivity to prompt phrasing. Our work provides a reproducible, scalable foundation for the community to build more general surgical video models. All code, models, and data will be publicly released.
翻译:手术视频理解对于计算机辅助干预至关重要,但现有手术基础模型仍受限于数据规模有限、手术流程多样性不足以及评估标准不一致,且常缺乏可复现的训练流程。我们提出SurgRec,一种可扩展且可复现的手术视频理解预训练方案,包含两个变体:SurgRec-MAE与SurgRec-JEPA。我们构建了一个大规模多源数据集,包含10,535个视频和2.145亿帧,覆盖内窥镜、腹腔镜、白内障及机器人手术场景。基于该数据集,我们开发了统一预训练流程(含平衡采样),并在16个下游数据集及四个临床领域上标准化了可复现基准(含一致的数据划分)。在与SSL基线及视觉-语言模型的广泛对比中,SurgRec在下游数据集上持续取得更优性能。相比之下,VLM在细粒度时间识别方面表现不可靠,既存在性能差距,又对提示措辞敏感。我们的工作为社区构建更通用的手术视频模型提供了可复现、可扩展的基础。所有代码、模型和数据将公开发布。