Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures

Recent advancements in surgical computer vision applications have been driven by fully-supervised methods, primarily using only visual data. These methods rely on manually annotated surgical videos to predict a fixed set of object categories, limiting their generalizability to unseen surgical procedures and downstream tasks. In this work, we put forward the idea that the surgical video lectures available through open surgical e-learning platforms can provide effective supervisory signals for multi-modal representation learning without relying on manual annotations. We address the surgery-specific linguistic challenges present in surgical video lectures by employing multiple complementary automatic speech recognition systems to generate text transcriptions. We then present a novel method, SurgVLP - Surgical Vision Language Pre-training, for multi-modal representation learning. SurgVLP constructs a new contrastive learning objective to align video clip embeddings with the corresponding multiple text embeddings by bringing them together within a joint latent space. To effectively show the representation capability of the learned joint latent space, we introduce several vision-and-language tasks for surgery, such as text-based video retrieval, temporal activity grounding, and video captioning, as benchmarks for evaluation. We further demonstrate that without using any labeled ground truth, our approach can be employed for traditional vision-only surgical downstream tasks, such as surgical tool, phase, and triplet recognition. The code will be made available at https://github.com/CAMMA-public/SurgVLP

翻译：近年来，外科计算机视觉应用的进展主要由全监督方法驱动，这些方法主要仅使用视觉数据。此类方法依赖人工标注的外科视频来预测固定类别的物体，限制了其对未见手术流程及下游任务的泛化能力。在本研究中，我们提出一种新思路：通过开放式外科电子学习平台提供的手术视频讲座，无需依赖人工标注即可为多模态表示学习提供有效的监督信号。我们采用多个互补的自动语音识别系统生成文本转录，以应对外科视频讲座中特有的语言挑战。随后，我们提出了一种新方法——SurgVLP（外科视觉语言预训练）——用于多模态表示学习。SurgVLP构建了一种新的对比学习目标，通过将视频片段嵌入与对应的多文本嵌入在联合隐空间中对齐，实现表示学习。为有效展示所学联合隐空间的表示能力，我们引入了多项外科领域的视觉-语言任务（如基于文本的视频检索、时间活动定位与视频字幕生成）作为基准评估。我们进一步证明，无需使用任何标注真值，该方法即可应用于传统纯视觉的外科下游任务（如手术器械、手术阶段及三联体识别）。代码将发布于 https://github.com/CAMMA-public/SurgVLP