Automatic surgical phase recognition is a core technology for modern operating rooms and online surgical video assessment platforms. Current state-of-the-art methods use both spatial and temporal information to tackle the surgical phase recognition task. Building on this idea, we propose the Multi-Scale Action Segmentation Transformer (MS-AST) for offline surgical phase recognition and the Multi-Scale Action Segmentation Causal Transformer (MS-ASCT) for online surgical phase recognition. We use ResNet50 or EfficientNetV2-M for spatial feature extraction. Our MS-AST and MS-ASCT can model temporal information at different scales with multi-scale temporal self-attention and multi-scale temporal cross-attention, which enhances the capture of temporal relationships between frames and segments. We demonstrate that our method can achieve 95.26% and 96.15% accuracy on the Cholec80 dataset for online and offline surgical phase recognition, respectively, which achieves new state-of-the-art results. Our method can also achieve state-of-the-art results on non-medical datasets in the video action segmentation domain.
翻译:自动手术阶段识别是现代手术室及在线手术视频评估平台的核心技术。当前最先进方法同时利用空间和时间信息处理手术阶段识别任务。基于此思路,我们提出用于离线手术阶段识别的多尺度动作分割Transformer(MS-AST)以及用于在线手术阶段识别的多尺度动作分割因果Transformer(MS-ASCT)。采用ResNet50或EfficientNetV2-M进行空间特征提取。我们的MS-AST与MS-ASCT通过多尺度时序自注意力机制和多尺度时序交叉注意力机制,可在不同尺度上建模时间信息,从而增强帧与片段间时间关系的捕捉能力。实验表明,本方法在Cholec80数据集上的在线与离线手术阶段识别准确率分别达到95.26%和96.15%,创造了新的最优记录。在视频动作分割领域的非医学数据集上,本方法同样达到了最优性能。