UniSurg: A Video-Native Foundation Model for Universal Understanding of Surgical Videos

Jinlin Wu,Felix Holm,Chuxi Chen,An Wang,Yaxin Hu,Xiaofan Ye,Zelin Zang,Miao Xu,Lihua Zhou,Huai Liao,Danny T. M. Chan,Ming Feng,Wai S. Poon,Hongliang Ren,Dong Yi,Nassir Navab,Gaofeng Meng,Jiebo Luo,Hongbin Liu,Zhen Lei

While foundation models have advanced surgical video analysis, current approaches rely predominantly on pixel-level reconstruction objectives that waste model capacity on low-level visual details - such as smoke, specular reflections, and fluid motion - rather than semantic structures essential for surgical understanding. We present UniSurg, a video-native foundation model that shifts the learning paradigm from pixel-level reconstruction to latent motion prediction. Built on the Video Joint Embedding Predictive Architecture (V-JEPA), UniSurg introduces three key technical innovations tailored to surgical videos: 1) motion-guided latent prediction to prioritize semantically meaningful regions, 2) spatiotemporal affinity self-distillation to enforce relational consistency, and 3) feature diversity regularization to prevent representation collapse in texture-sparse surgical scenes. To enable large-scale pretraining, we curate UniSurg-15M, the largest surgical video dataset to date, comprising 3,658 hours of video from 50 sources across 13 anatomical regions. Extensive experiments across 17 benchmarks demonstrate that UniSurg significantly outperforms state-of-the-art methods on surgical workflow recognition (+14.6% F1 on EgoSurgery, +10.3% on PitVis), action triplet recognition (39.54% mAP-IVT on CholecT50), skill assessment, polyp segmentation, and depth estimation. These results establish UniSurg as a new standard for universal, motion-oriented surgical video understanding.

翻译：尽管基础模型已推动了手术视频分析的发展，但当前方法主要依赖于像素级重建目标，这导致模型能力浪费在低层次视觉细节（如烟雾、镜面反射和流体运动）上，而非对手术理解至关重要的语义结构上。我们提出了UniSurg，一种视频原生基础模型，它将学习范式从像素级重建转变为潜在运动预测。基于视频联合嵌入预测架构（V-JEPA）构建，UniSurg针对手术视频引入了三项关键技术创新：1）运动引导的潜在预测，以优先处理具有语义意义的区域；2）时空亲和性自蒸馏，以强制关系一致性；3）特征多样性正则化，以防止在纹理稀疏的手术场景中出现表示坍缩。为实现大规模预训练，我们构建了迄今为止最大的手术视频数据集UniSurg-15M，包含来自13个解剖区域50个来源的3,658小时视频。在17个基准测试上的广泛实验表明，UniSurg在外科工作流识别（EgoSurgery上F1分数提升+14.6%，PitVis上提升+10.3%）、动作三元组识别（CholecT50上mAP-IVT为39.54%）、技能评估、息肉分割和深度估计等任务上均显著优于现有最先进方法。这些结果确立了UniSurg作为通用、面向运动的手术视频理解的新标准。