LoViT: Long Video Transformer for Surgical Phase Recognition

Online surgical phase recognition plays a significant role towards building contextual tools that could quantify performance and oversee the execution of surgical workflows. Current approaches are limited since they train spatial feature extractors using frame-level supervision that could lead to incorrect predictions due to similar frames appearing at different phases, and poorly fuse local and global features due to computational constraints which can affect the analysis of long videos commonly encountered in surgical interventions. In this paper, we present a two-stage method, called Long Video Transformer (LoViT) for fusing short- and long-term temporal information that combines a temporally-rich spatial feature extractor and a multi-scale temporal aggregator consisting of two cascaded L-Trans modules based on self-attention, followed by a G-Informer module based on ProbSparse self-attention for processing global temporal information. The multi-scale temporal head then combines local and global features and classifies surgical phases using phase transition-aware supervision. Our approach outperforms state-of-the-art methods on the Cholec80 and AutoLaparo datasets consistently. Compared to Trans-SVNet, LoViT achieves a 2.39 pp (percentage point) improvement in video-level accuracy on Cholec80 and a 3.14 pp improvement on AutoLaparo. Moreover, it achieves a 5.25 pp improvement in phase-level Jaccard on AutoLaparo and a 1.55 pp improvement on Cholec80. Our results demonstrate the effectiveness of our approach in achieving state-of-the-art performance of surgical phase recognition on two datasets of different surgical procedures and temporal sequencing characteristics whilst introducing mechanisms that cope with long videos.

翻译：摘要：在线手术阶段识别在构建能够量化手术表现并监督手术流程执行的上下文工具中发挥着重要作用。当前方法受限于以下问题：使用帧级监督训练空间特征提取器，可能导致因不同阶段出现相似帧而产生错误预测；同时，由于计算约束难以有效融合局部与全局特征，这会影响对手术干预中常见长视频的分析。本文提出一种名为长视频变换器（LoViT）的两阶段方法，用于融合短期与长期时序信息。该方法结合了富含时序信息的空间特征提取器，以及由两个级联的基于自注意力的L-Trans模块和一个基于ProbSparse自注意力的G-Informer模块组成的多尺度时序聚合器，用于处理全局时序信息。随后，多尺度时序头融合局部与全局特征，并利用阶段转换感知监督进行手术阶段分类。我们的方法在Cholec80和AutoLaparo数据集上持续优于现有最优方法。与Trans-SVNet相比，LoViT在Cholec80上的视频级准确率提升2.39个百分点，在AutoLaparo上提升3.14个百分点。此外，在AutoLaparo上的阶段级Jaccard指数提升5.25个百分点，在Cholec80上提升1.55个百分点。实验结果证明，本方法在两种不同手术过程与时序特征的数据集上实现了手术阶段识别的最优性能，同时引入了应对长视频的机制。