The growing popularity of robotic minimally invasive surgeries has made deep learning-based surgical training a key area of research. A thorough understanding of the surgical scene components is crucial, which semantic segmentation models can help achieve. However, most existing work focuses on surgical tools and overlooks anatomical objects. Additionally, current state-of-the-art (SOTA) models struggle to balance capturing high-level contextual features and low-level edge features. We propose a Feature-Adaptive Spatial Localization model (FASL-Seg), designed to capture features at multiple levels of detail through two distinct processing streams, namely a Low-Level Feature Projection (LLFP) and a High-Level Feature Projection (HLFP) stream, for varying feature resolutions - enabling precise segmentation of anatomy and surgical instruments. We evaluated FASL-Seg on surgical segmentation benchmark datasets EndoVis18 and EndoVis17 on three use cases. The FASL-Seg model achieves a mean Intersection over Union (mIoU) of 72.71% on parts and anatomy segmentation in EndoVis18, improving on SOTA by 5%. It further achieves a mIoU of 85.61% and 72.78% in EndoVis18 and EndoVis17 tool type segmentation, respectively, outperforming SOTA overall performance, with comparable per-class SOTA results in both datasets and consistent performance in various classes for anatomy and instruments, demonstrating the effectiveness of distinct processing streams for varying feature resolutions.
翻译:随着机器人辅助微创手术的日益普及,基于深度学习的手术训练已成为关键研究领域。全面理解手术场景的构成要素至关重要,而语义分割模型有助于实现这一目标。然而,现有研究大多集中于手术器械,却忽视了解剖结构对象。此外,当前最先进的模型难以在捕获高层上下文特征与低层边缘特征之间取得平衡。我们提出了一种特征自适应空间定位模型(FASL-Seg),该模型通过两个独立的处理流——即低层特征投影流(LLFP)与高层特征投影流(HLFP)——来捕获多尺度细节特征,以适应不同特征分辨率,从而实现解剖结构与手术器械的精确分割。我们在手术分割基准数据集EndoVis18和EndoVis17上针对三种应用场景评估了FASL-Seg模型。在EndoVis18的部件与解剖结构分割任务中,FASL-Seg模型取得了72.71%的平均交并比(mIoU),较最先进模型提升5%。在EndoVis18和EndoVis17的器械类型分割任务中,该模型分别达到85.61%和72.78%的mIoU,整体性能超越现有最优模型,在两个数据集中均取得与最先进方法相当的各类别结果,且在解剖结构与器械的各类别中表现稳定,这验证了针对不同特征分辨率采用差异化处理流的有效性。