Phase recognition in surgical videos is crucial for enhancing computer-aided surgical systems as it enables automated understanding of sequential procedural stages. Existing methods often rely on fixed temporal windows for video analysis to identify dynamic surgical phases. Thus, they struggle to simultaneously capture short-, mid-, and long-term information necessary to fully understand complex surgical procedures. To address these issues, we propose Multi-Scale Transformers for Surgical Phase Recognition (MuST), a novel Transformer-based approach that combines a Multi-Term Frame encoder with a Temporal Consistency Module to capture information across multiple temporal scales of a surgical video. Our Multi-Term Frame Encoder computes interdependencies across a hierarchy of temporal scales by sampling sequences at increasing strides around the frame of interest. Furthermore, we employ a long-term Transformer encoder over the frame embeddings to further enhance long-term reasoning. MuST achieves higher performance than previous state-of-the-art methods on three different public benchmarks.
翻译:手术视频中的阶段识别对于增强计算机辅助手术系统至关重要,因为它能实现对连续手术步骤的自动化理解。现有方法通常依赖固定的时间窗口进行视频分析以识别动态的手术阶段。因此,这些方法难以同时捕获短期、中期和长期信息,而这些信息对于充分理解复杂的手术过程是必需的。为解决这些问题,我们提出用于手术阶段识别的多尺度Transformer(MuST),这是一种基于Transformer的新方法,它将多时段帧编码器与时间一致性模块相结合,以捕获手术视频在多个时间尺度上的信息。我们的多时段帧编码器通过在关注帧周围以递增步长采样序列,计算跨层次时间尺度的相互依赖关系。此外,我们在帧嵌入上采用长期Transformer编码器,以进一步增强长期推理能力。在三个不同的公共基准测试中,MuST实现了比以往最先进方法更高的性能。