The Segment Anything Model (SAM) has significantly advanced interactive segmentation but struggles with high-resolution images crucial for high-precision segmentation. This is primarily due to the quadratic space complexity of SAM-implemented attention and the length extrapolation issue in common global attention. This study proposes HRSAM that integrates Flash Attention and incorporates Plain, Shifted and newly proposed Cycle-scan Window (PSCWin) attention to address these issues. The shifted window attention is redesigned with padding to maintain consistent window sizes, enabling effective length extrapolation. The cycle-scan window attention adopts the recently developed State Space Models (SSMs) to ensure global information exchange with minimal computational overhead. Such window-based attention allows HRSAM to perform effective attention computations on scaled input images while maintaining low latency. Moreover, we further propose HRSAM++ that additionally employs a multi-scale strategy to enhance HRSAM's performance. The experiments on the high-precision segmentation datasets HQSeg44K and DAVIS show that high-resolution inputs enable the SAM-distilled HRSAM models to outperform the teacher model while maintaining lower latency. Compared to the SOTAs, HRSAM achieves a 1.56 improvement in interactive segmentation's NoC95 metric with only 31% of the latency. HRSAM++ further enhances the performance, achieving a 1.63 improvement in NoC95 with just 38% of the latency.
翻译:Segment Anything Model (SAM) 显著推动了交互式分割技术的发展,但在处理高精度分割所需的高分辨率图像时仍面临挑战。这主要源于 SAM 实现的注意力机制具有二次空间复杂度,以及常见全局注意力中存在的长度外推问题。本研究提出 HRSAM 模型,通过集成 Flash Attention 并结合 Plain、Shifted 及新提出的 Cycle-scan Window (PSCWin) 注意力机制来解决上述问题。其中,移位窗口注意力经过重新设计,采用填充策略以保持窗口尺寸一致性,从而实现有效的长度外推。循环扫描窗口注意力采用最新发展的状态空间模型 (SSMs),以最小计算开销确保全局信息交换。此类基于窗口的注意力机制使 HRSAM 能够在缩放输入图像上执行有效的注意力计算,同时保持低延迟。此外,我们进一步提出 HRSAM++ 模型,额外采用多尺度策略以增强 HRSAM 的性能。在高精度分割数据集 HQSeg44K 和 DAVIS 上的实验表明,高分辨率输入使得基于 SAM 蒸馏的 HRSAM 模型在保持更低延迟的同时,性能超越了教师模型。与当前最优方法相比,HRSAM 在交互式分割的 NoC95 指标上提升了 1.56,而延迟仅为基准的 31%。HRSAM++ 进一步提升了性能,以仅 38% 的延迟实现了 NoC95 指标 1.63 的改进。