Surg$Σ$: A Spectrum of Large-Scale Multimodal Data and Foundation Models for Surgical Intelligence

Zhitao Zeng,Mengya Xu,Jian Jiang,Pengfei Guo,Yunqiu Xu,Zhu Zhuo,Chang Han Low,Yufan He,Dong Yang,Chenxi Lin,Yiming Gu,Jiaxin Guo,Yutong Ban,Daguang Xu,Qi Dou,Yueming Jin

Surgical intelligence has the potential to improve the safety and consistency of surgical care, yet most existing surgical AI frameworks remain task-specific and struggle to generalize across procedures and institutions. Although multimodal foundation models, particularly multimodal large language models, have demonstrated strong cross-task capabilities across various medical domains, their advancement in surgery remains constrained by the lack of large-scale, systematically curated multimodal data. To address this challenge, we introduce Surg$Σ$, a spectrum of large-scale multimodal data and foundation models for surgical intelligence. At the core of this framework lies Surg$Σ$-DB, a large-scale multimodal data foundation designed to support diverse surgical tasks. Surg$Σ$-DB consolidates heterogeneous surgical data sources (including open-source datasets, curated in-house clinical collections and web-source data) into a unified schema, aiming to improve label consistency and data standardization across heterogeneous datasets. Surg$Σ$-DB spans 6 clinical specialties and diverse surgical types, providing rich image- and video-level annotations across 18 practical surgical tasks covering understanding, reasoning, planning, and generation, at an unprecedented scale (over 5.98M conversations). Beyond conventional multimodal conversations, Surg$Σ$-DB incorporates hierarchical reasoning annotations, providing richer semantic cues to support deeper contextual understanding in complex surgical scenarios. We further provide empirical evidence through recently developed surgical foundation models built upon Surg$Σ$-DB, illustrating the practical benefits of large-scale multimodal annotations, unified semantic design, and structured reasoning annotations for improving cross-task generalization and interpretability.

翻译：外科智能具有提升外科护理安全性与一致性的潜力，然而现有大多数外科人工智能框架仍局限于特定任务，难以在不同手术流程与机构间泛化。尽管多模态基础模型，尤其是多模态大语言模型，已在多个医学领域展现出强大的跨任务能力，但其在外科领域的进展仍受限于缺乏大规模、系统性构建的多模态数据。为应对这一挑战，我们提出Surg$Σ$，一个面向外科智能的大规模多模态数据与基础模型谱系。该框架的核心是Surg$Σ$-DB，一个旨在支持多样化外科任务的大规模多模态数据基础。Surg$Σ$-DB将异构的外科数据源（包括开源数据集、内部整理的临床数据集合以及网络来源数据）整合至统一架构中，旨在提升异构数据集间的标签一致性与数据标准化水平。Surg$Σ$-DB涵盖6个临床专科及多样化的手术类型，以前所未有的规模（超过598万组对话）为18项涵盖理解、推理、规划与生成的实用外科任务提供丰富的图像级与视频级标注。除传统的多模态对话外，Surg$Σ$-DB还纳入了层次化推理标注，为复杂外科场景中更深层次的上下文理解提供更丰富的语义线索。我们进一步通过基于Surg$Σ$-DB近期开发的外科基础模型提供实证证据，阐明大规模多模态标注、统一语义设计及结构化推理标注对于提升跨任务泛化能力与可解释性的实际益处。