LoViT: Long Video Transformer for Surgical Phase Recognition

Online surgical phase recognition plays a significant role towards building contextual tools that could quantify performance and oversee the execution of surgical workflows. Current approaches are limited since they train spatial feature extractors using frame-level supervision that could lead to incorrect predictions due to similar frames appearing at different phases, and poorly fuse local and global features due to computational constraints which can affect the analysis of long videos commonly encountered in surgical interventions. In this paper, we present a two-stage method, called Long Video Transformer (LoViT) for fusing short- and long-term temporal information that combines a temporally-rich spatial feature extractor and a multi-scale temporal aggregator consisting of two cascaded L-Trans modules based on self-attention, followed by a G-Informer module based on ProbSparse self-attention for processing global temporal information. The multi-scale temporal head then combines local and global features and classifies surgical phases using phase transition-aware supervision. Our approach outperforms state-of-the-art methods on the Cholec80 and AutoLaparo datasets consistently. Compared to Trans-SVNet, LoViT achieves a 2.39 pp (percentage point) improvement in video-level accuracy on Cholec80 and a 3.14 pp improvement on AutoLaparo. Moreover, it achieves a 5.25 pp improvement in phase-level Jaccard on AutoLaparo and a 1.55 pp improvement on Cholec80. Our results demonstrate the effectiveness of our approach in achieving state-of-the-art performance of surgical phase recognition on two datasets of different surgical procedures and temporal sequencing characteristics whilst introducing mechanisms that cope with long videos.

翻译：在线手术阶段识别在构建能够量化手术绩效并监督手术工作流程执行的上下文工具中发挥着重要作用。现有方法存在局限性：它们利用帧级监督训练空间特征提取器，由于不同阶段可能出现相似帧，导致错误预测；同时受计算约束限制，局部与全局特征融合不佳，影响对手术干预中常见长视频的分析。本文提出一种名为长视频Transformer（LoViT）的两阶段方法，用于融合短时与长时时间信息。该方法结合了时域丰富的空间特征提取器，以及由两个基于自注意力的级联L-Trans模块和基于ProbSparse自注意力的G-Informer模块组成的多尺度时间聚合器，后者用于处理全局时间信息。多尺度时域头部随后融合局部与全局特征，并利用阶段转换感知监督进行手术阶段分类。本方法在Cholec80和AutoLaparo数据集上持续优于现有最优方法。与Trans-SVNet相比，LoViT在Cholec80上的视频级准确率提升2.39个百分点，在AutoLaparo上提升3.14个百分点。此外，在AutoLaparo上的阶段级Jaccard指数提升5.25个百分点，在Cholec80上提升1.55个百分点。实验结果证明，本方法在两种不同手术流程与时序特征的数据集上实现了手术阶段识别的最优性能，同时引入了应对长视频的机制。