Multimodal Optimal Transport for Unsupervised Temporal Segmentation in Surgical Robotics

Recognizing surgical phases and steps from video is a fundamental problem in computer-assisted interventions. Recent approaches increasingly rely on large-scale pre-training on thousands of labeled surgical videos, followed by zero-shot transfer to specific procedures. While effective, this strategy incurs substantial computational and data collection costs. In this work, we question whether such heavy pre-training is truly necessary. We propose Text-Augmented Action Segmentation Optimal Transport (TASOT), an unsupervised method for surgical phase and step recognition that extends Action Segmentation Optimal Transport (ASOT) by incorporating textual information generated directly from the videos. TASOT formulates temporal action segmentation as a multimodal optimal transport problem, where the matching cost is defined as a weighted combination of visual and text-based costs. The visual term captures frame-level appearance similarity, while the text term provides complementary semantic cues, and both are jointly regularized through a temporally consistent unbalanced Gromov-Wasserstein formulation. This design enables effective alignment between video frames and surgical actions without surgical-specific pretraining or external web-scale supervision. We evaluate TASOT on multiple benchmark surgical datasets and observe consistent and substantial improvements over existing zero-shot methods, including StrasBypass70 (+23.7), BernBypass70 (+4.5), Cholec80 (+16.5), and AutoLaparo (+19.6). These results demonstrate that fine-grained surgical understanding can be achieved by exploiting information already present in standard visual and textual representations, without resorting to increasingly complex pre-training pipelines. The code will be available at https://github.com/omar8ahmed9/TASOT.

翻译：从视频中识别手术阶段与步骤是计算机辅助干预领域的基础性问题。当前方法日益依赖于对数千个标注手术视频进行大规模预训练，随后通过零样本迁移适应特定术式。尽管有效，该策略需承担巨大的计算与数据收集成本。本文质疑此类重型预训练是否确有必要。我们提出文本增强动作分割最优传输方法，这是一种用于手术阶段与步骤识别的无监督方法，通过融入直接从视频生成的文本信息扩展了动作分割最优传输框架。该方法将时序动作分割建模为多模态最优传输问题，其中匹配成本定义为视觉成本与文本成本的加权组合。视觉项捕捉帧级表观相似性，文本项提供互补的语义线索，二者通过具有时序一致性的非平衡Gromov-Wasserstein框架进行联合正则化。该设计实现了视频帧与手术动作间的有效对齐，无需手术专用预训练或外部网络级监督。我们在多个基准手术数据集上评估本方法，相较于现有零样本方法取得持续显著改进：StrasBypass70 (+23.7)、BernBypass70 (+4.5)、Cholec80 (+16.5)、AutoLaparo (+19.6)。这些结果表明，通过利用标准视觉与文本表征中既存的信息，无需依赖日益复杂的预训练流程即可实现细粒度手术理解。代码发布于https://github.com/omar8ahmed9/TASOT。