Stable Segment Anything Model

The Segment Anything Model (SAM) achieves remarkable promptable segmentation given high-quality prompts which, however, often require good skills to specify. To make SAM robust to casual prompts, this paper presents the first comprehensive analysis on SAM's segmentation stability across a diverse spectrum of prompt qualities, notably imprecise bounding boxes and insufficient points. Our key finding reveals that given such low-quality prompts, SAM's mask decoder tends to activate image features that are biased towards the background or confined to specific object parts. To mitigate this issue, our key idea consists of calibrating solely SAM's mask attention by adjusting the sampling locations and amplitudes of image features, while the original SAM model architecture and weights remain unchanged. Consequently, our deformable sampling plugin (DSP) enables SAM to adaptively shift attention to the prompted target regions in a data-driven manner, facilitated by our effective robust training strategy (RTS). During inference, dynamic routing plugin (DRP) is proposed that toggles SAM between the deformable and regular grid sampling modes, conditioned on the input prompt quality. Thus, our solution, termed Stable-SAM, offers several advantages: 1) improved SAM's segmentation stability across a wide range of prompt qualities, while 2) retaining SAM's powerful promptable segmentation efficiency and generality, with 3) minimal learnable parameters (0.08 M) and fast adaptation (by 1 training epoch). Extensive experiments across multiple datasets validate the effectiveness and advantages of our approach, underscoring Stable-SAM as a more robust solution for segmenting anything. Codes will be released upon acceptance. https://github.com/fanq15/Stable-SAM

翻译：Segment Anything模型（SAM）在高品质提示（prompts）下实现了卓越的可提示分割，然而这些提示通常需要良好的技巧才能指定。为使SAM对随意提示更具鲁棒性，本文首次全面分析了SAM在不同提示质量（尤其是非精确边界框和不足点）下的分割稳定性。关键发现表明，在低质量提示下，SAM的掩码解码器倾向于激活偏向背景或局限于特定物体部分的图像特征。为缓解此问题，我们的核心思想是通过调整图像特征的采样位置和幅度，仅校准SAM的掩码注意力，而原始SAM模型架构和权重保持不变。因此，我们的可变形采样插件（DSP）使SAM能够以数据驱动方式自适应地将注意力转移到提示目标区域，这得益于我们有效的鲁棒训练策略（RTS）。在推理阶段，提出了动态路由插件（DRP），根据输入提示质量在可变形采样和规则网格采样模式之间切换SAM。因此，我们的解决方案称为Stable-SAM，具有以下优势：1）在广泛的提示质量范围内提升SAM的分割稳定性，同时2）保留SAM强大的可提示分割效率和通用性，3）仅需极少可学习参数（0.08 M）和快速适应（1个训练周期）。跨多个数据集的广泛实验验证了我们方法的有效性和优势，强调Stable-SAM作为更鲁棒的通用分割解决方案。代码将在接收后发布。https://github.com/fanq15/Stable-SAM