Surgical phase recognition is a critical component for context-aware decision support in intelligent operating rooms, yet training robust models is hindered by limited annotated clinical videos and large domain gaps between synthetic and real surgical data. To address this, we propose CauCLIP, a causality-inspired vision-language framework that leverages CLIP to learn domain-invariant representations for surgical phase recognition without access to target domain data. Our approach integrates a frequency-based augmentation strategy to perturb domain-specific attributes while preserving semantic structures, and a causal suppression loss that mitigates non-causal biases and reinforces causal surgical features. These components are combined in a unified training framework that enables the model to focus on stable causal factors underlying surgical workflows. Experiments on the SurgVisDom hard adaptation benchmark demonstrate that our method substantially outperforms all competing approaches, highlighting the effectiveness of causality-guided vision-language models for domain-generalizable surgical video understanding.
翻译:手术阶段识别是智能手术室中情境感知决策支持的关键组成部分,然而,训练鲁棒模型受到带标注临床视频有限以及合成与真实手术数据之间存在巨大领域差距的阻碍。为解决此问题,我们提出了CauCLIP,一个因果启发的视觉语言框架,它利用CLIP来学习手术阶段识别的领域不变表示,而无需访问目标领域数据。我们的方法集成了一个基于频率的增强策略,以扰动领域特定属性同时保留语义结构,以及一个因果抑制损失,该损失减轻了非因果偏差并强化了因果手术特征。这些组件在一个统一的训练框架中结合,使模型能够专注于手术工作流底层的稳定因果因素。在SurgVisDom硬适应基准上的实验表明,我们的方法显著优于所有竞争方法,凸显了因果引导的视觉语言模型在领域可泛化手术视频理解方面的有效性。