HecVL: Hierarchical Video-Language Pretraining for Zero-shot Surgical Phase Recognition

Natural language could play an important role in developing generalist surgical models by providing a broad source of supervision from raw texts. This flexible form of supervision can enable the model's transferability across datasets and tasks as natural language can be used to reference learned visual concepts or describe new ones. In this work, we present HecVL, a novel hierarchical video-language pretraining approach for building a generalist surgical model. Specifically, we construct a hierarchical video-text paired dataset by pairing the surgical lecture video with three hierarchical levels of texts: at clip-level, atomic actions using transcribed audio texts; at phase-level, conceptual text summaries; and at video-level, overall abstract text of the surgical procedure. Then, we propose a novel fine-to-coarse contrastive learning framework that learns separate embedding spaces for the three video-text hierarchies using a single model. By disentangling embedding spaces of different hierarchical levels, the learned multi-modal representations encode short-term and long-term surgical concepts in the same model. Thanks to the injected textual semantics, we demonstrate that the HecVL approach can enable zero-shot surgical phase recognition without any human annotation. Furthermore, we show that the same HecVL model for surgical phase recognition can be transferred across different surgical procedures and medical centers.

翻译：自然语言可通过原始文本提供广泛的监督信号，在开发通用型手术模型中发挥重要作用。这种灵活的监督形式能够使模型具备跨数据集和任务的迁移能力——自然语言既可被用于引用已习得的视觉概念，也可被用于描述新概念。本文提出HecVL，一种用于构建通用型手术模型的新型层次化视频-语言预训练方法。具体而言，我们将手术教学视频与三个层次化的文本进行配对构建层次化视频-文本数据集：在片段级别，利用转录的音频文本生成原子动作描述；在阶段级别，生成概念性文本摘要；在视频级别，生成手术过程的整体摘要文本。随后，我们提出一种新颖的由细到粗的对比学习框架，通过单一模型为三个视频-文本层次学习独立的嵌入空间。通过解耦不同层次级别的嵌入空间，所学的多模态表示能够在同一模型中编码短期与长期的手术概念。得益于注入的文本语义，我们证明HecVL方法可在无需任何人工标注的情况下实现零样本手术阶段识别。此外，我们展示用于手术阶段识别的同一HecVL模型能够迁移至不同手术流程和医疗中心。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日