Surgical scene understanding is a key technical component for enabling intelligent and context aware systems that can transform various aspects of surgical interventions. In this work, we focus on the semantic segmentation task, propose a simple yet effective multi-modal (RGB and depth) training framework called SurgDepth, and show state-of-the-art (SOTA) results on all publicly available datasets applicable for this task. Unlike previous approaches, which either fine-tune SOTA segmentation models trained on natural images, or encode RGB or RGB-D information using RGB only pre-trained backbones, SurgDepth, which is built on top of Vision Transformers (ViTs), is designed to encode both RGB and depth information through a simple fusion mechanism. We conduct extensive experiments on benchmark datasets including EndoVis2022, AutoLapro, LapI2I and EndoVis2017 to verify the efficacy of SurgDepth. Specifically, SurgDepth achieves a new SOTA IoU of 0.86 on EndoVis 2022 SAR-RARP50 challenge and outperforms the current best method by at least 4%, using a shallow and compute efficient decoder consisting of ConvNeXt blocks.
翻译:手术场景理解是实现智能化和情境感知系统的关键技术组成部分,该系统能够改变手术干预的各个方面。在本研究中,我们聚焦于语义分割任务,提出了一种简单而有效的多模态(RGB与深度)训练框架SurgDepth,并在所有适用于此任务的公开数据集上展示了最先进(SOTA)的结果。与先前方法不同——这些方法要么微调在自然图像上训练的SOTA分割模型,要么使用仅RGB预训练的主干网络编码RGB或RGB-D信息——SurgDepth构建于Vision Transformers(ViTs)之上,旨在通过简单的融合机制同时编码RGB和深度信息。我们在包括EndoVis2022、AutoLapro、LapI2I和EndoVis2017在内的基准数据集上进行了大量实验,以验证SurgDepth的有效性。具体而言,SurgDepth在EndoVis 2022 SAR-RARP50挑战赛上实现了0.86的新SOTA IoU,并使用由ConvNeXt块组成的浅层且计算高效的解码器,至少超越当前最佳方法4%。