Automated surgical step recognition is an important task that can significantly improve patient safety and decision-making during surgeries. Existing state-of-the-art methods for surgical step recognition either rely on separate, multi-stage modeling of spatial and temporal information or operate on short-range temporal resolution when learned jointly. However, the benefits of joint modeling of spatio-temporal features and long-range information are not taken in account. In this paper, we propose a vision transformer-based approach to jointly learn spatio-temporal features directly from sequence of frame-level patches. Our method incorporates a gated-temporal attention mechanism that intelligently combines short-term and long-term spatio-temporal feature representations. We extensively evaluate our approach on two cataract surgery video datasets, namely Cataract-101 and D99, and demonstrate superior performance compared to various state-of-the-art methods. These results validate the suitability of our proposed approach for automated surgical step recognition. Our code is released at: https://github.com/nisargshah1999/GLSFormer
翻译:自动手术步骤识别是一项重要任务,可显著提升手术过程中的患者安全性和决策支持。现有最佳方法要么依赖分离的、多阶段的空间与时间信息建模,要么在联合学习时仅处理短时域分辨率。然而,这些方法未能考虑时空特征与长程信息联合建模的优势。本文提出一种基于视觉Transformer的方法,可直接从帧级图像块序列中联合学习时空特征。该方法引入门控时序注意力机制,智能融合短时与长时空域特征表示。我们在两个白内障手术视频数据集(Cataract-101和D99)上进行了广泛评估,结果表明该方法相比多种现有先进方法具有更优性能。这些结果验证了所提方法在自动手术步骤识别任务中的适用性。代码已在以下地址开源:https://github.com/nisargshah1999/GLSFormer