Most benchmarks for studying surgical interventions focus on a specific challenge instead of leveraging the intrinsic complementarity among different tasks. In this work, we present a new experimental framework towards holistic surgical scene understanding. First, we introduce the Phase, Step, Instrument, and Atomic Visual Action recognition (PSI-AVA) Dataset. PSI-AVA includes annotations for both long-term (Phase and Step recognition) and short-term reasoning (Instrument detection and novel Atomic Action recognition) in robot-assisted radical prostatectomy videos. Second, we present Transformers for Action, Phase, Instrument, and steps Recognition (TAPIR) as a strong baseline for surgical scene understanding. TAPIR leverages our dataset's multi-level annotations as it benefits from the learned representation on the instrument detection task to improve its classification capacity. Our experimental results in both PSI-AVA and other publicly available databases demonstrate the adequacy of our framework to spur future research on holistic surgical scene understanding.
翻译:大多数研究手术干预的基准工作聚焦于特定挑战,而非利用不同任务之间的内在互补性。本研究提出了一种面向整体手术场景理解的新实验框架。首先,我们构建了阶段、步骤、器械与原子视觉动作识别(PSI-AVA)数据集。该数据集包含机器人辅助根治性前列腺切除术视频中长程推理(阶段与步骤识别)和短程推理(器械检测与新型原子动作识别)的标注信息。其次,我们提出面向动作、阶段、器械与步骤识别的Transformer模型(TAPIR),作为手术场景理解的强基线方法。TAPIR利用数据集的多层级标注优势,通过器械检测任务学习到的表征提升分类能力。在PSI-AVA及其他公开数据库上的实验结果表明,该框架足以推动未来整体手术场景理解的研究。