This paper presents the Holistic and Multi-Granular Surgical Scene Understanding of Prostatectomies (GraSP) dataset, a curated benchmark that models surgical scene understanding as a hierarchy of complementary tasks with varying levels of granularity. Our approach enables a multi-level comprehension of surgical activities, encompassing long-term tasks such as surgical phases and steps recognition and short-term tasks including surgical instrument segmentation and atomic visual actions detection. To exploit our proposed benchmark, we introduce the Transformers for Actions, Phases, Steps, and Instrument Segmentation (TAPIS) model, a general architecture that combines a global video feature extractor with localized region proposals from an instrument segmentation model to tackle the multi-granularity of our benchmark. Through extensive experimentation, we demonstrate the impact of including segmentation annotations in short-term recognition tasks, highlight the varying granularity requirements of each task, and establish TAPIS's superiority over previously proposed baselines and conventional CNN-based models. Additionally, we validate the robustness of our method across multiple public benchmarks, confirming the reliability and applicability of our dataset. This work represents a significant step forward in Endoscopic Vision, offering a novel and comprehensive framework for future research towards a holistic understanding of surgical procedures.
翻译:本文提出了前列腺切除术整体与多粒度手术场景理解(GraSP)数据集,这是一个经过精心整理的基准数据集,将手术场景理解建模为具有不同粒度层次的互补任务层级结构。该方法能够实现手术活动的多层次理解,涵盖手术阶段与步骤识别等长期任务,以及手术器械分割和原子级视觉动作检测等短期任务。为充分利用所提出的基准数据集,我们引入了面向动作、阶段、步骤与器械分割的Transformer模型(TAPIS),该通用架构将全局视频特征提取器与来自器械分割模型的局部区域提议相结合,以解决基准数据集中的多粒度问题。通过大量实验,我们证明了将分割标注纳入短期识别任务的影响,突出了各任务的不同粒度需求,并验证了TAPIS相较于先前提出的基线模型及传统CNN模型的优越性。此外,我们还在多个公开基准上验证了方法的鲁棒性,证实了该数据集的可靠性与适用性。这项工作标志着内窥镜视觉领域的重要进展,为未来实现手术流程的整体理解提供了新颖且全面的研究框架。