Surgical videos captured from microscopic or endoscopic imaging devices are rich but complex sources of information, depicting different tools and anatomical structures utilized during an extended amount of time. Despite containing crucial workflow information and being commonly recorded in many procedures, usage of surgical videos for automated surgical workflow understanding is still limited. In this work, we exploit scene graphs as a more holistic, semantically meaningful and human-readable way to represent surgical videos while encoding all anatomical structures, tools, and their interactions. To properly evaluate the impact of our solutions, we create a scene graph dataset from semantic segmentations from the CaDIS and CATARACTS datasets. We demonstrate that scene graphs can be leveraged through the use of graph convolutional networks (GCNs) to tackle surgical downstream tasks such as surgical workflow recognition with competitive performance. Moreover, we demonstrate the benefits of surgical scene graphs regarding the explainability and robustness of model decisions, which are crucial in the clinical setting.
翻译:从显微或内窥镜成像设备捕获的手术视频是丰富但复杂的信息源,描绘了长时间内使用的不同器械和解剖结构。尽管这些视频包含关键的工作流信息,并且在许多手术过程中被常规记录,但将其用于自动化手术工作流理解仍十分有限。本研究利用场景图作为一种更全面、语义有意义且人类可读的方式来表征手术视频,同时对解剖结构、器械及其交互进行编码。为正确评估解决方案的影响,我们从CaDIS和CATARACTS数据集的语义分割结果构建了一个场景图数据集。我们证明,通过使用图卷积网络(GCN)可以充分利用场景图来处理手术下游任务(如手术工作流识别),并取得具有竞争力的性能。此外,我们展示了手术场景图在模型决策可解释性和鲁棒性方面的优势,这些在临床环境中至关重要。