Natural language could play an important role in developing generalist surgical models by providing a broad source of supervision from raw texts. This flexible form of supervision can enable the model's transferability across datasets and tasks as natural language can be used to reference learned visual concepts or describe new ones. In this work, we present HecVL, a novel hierarchical video-language pretraining approach for building a generalist surgical model. Specifically, we construct a hierarchical video-text paired dataset by pairing the surgical lecture video with three hierarchical levels of texts: at clip-level, atomic actions using transcribed audio texts; at phase-level, conceptual text summaries; and at video-level, overall abstract text of the surgical procedure. Then, we propose a novel fine-to-coarse contrastive learning framework that learns separate embedding spaces for the three video-text hierarchies using a single model. By disentangling embedding spaces of different hierarchical levels, the learned multi-modal representations encode short-term and long-term surgical concepts in the same model. Thanks to the injected textual semantics, we demonstrate that the HecVL approach can enable zero-shot surgical phase recognition without any human annotation. Furthermore, we show that the same HecVL model for surgical phase recognition can be transferred across different surgical procedures and medical centers.
翻译:自然语言可通过原始文本提供广泛的监督信号,在开发通用型手术模型中发挥重要作用。这种灵活的监督形式能够使模型具备跨数据集和任务的迁移能力——自然语言既可被用于引用已习得的视觉概念,也可被用于描述新概念。本文提出HecVL,一种用于构建通用型手术模型的新型层次化视频-语言预训练方法。具体而言,我们将手术教学视频与三个层次化的文本进行配对构建层次化视频-文本数据集:在片段级别,利用转录的音频文本生成原子动作描述;在阶段级别,生成概念性文本摘要;在视频级别,生成手术过程的整体摘要文本。随后,我们提出一种新颖的由细到粗的对比学习框架,通过单一模型为三个视频-文本层次学习独立的嵌入空间。通过解耦不同层次级别的嵌入空间,所学的多模态表示能够在同一模型中编码短期与长期的手术概念。得益于注入的文本语义,我们证明HecVL方法可在无需任何人工标注的情况下实现零样本手术阶段识别。此外,我们展示用于手术阶段识别的同一HecVL模型能够迁移至不同手术流程和医疗中心。