BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model

In this paper, we address the challenge of image resolution variation for the Segment Anything Model (SAM). SAM, known for its zero-shot generalizability, exhibits a performance degradation when faced with datasets with varying image sizes. Previous approaches tend to resize the image to a fixed size or adopt structure modifications, hindering the preservation of SAM's rich prior knowledge. Besides, such task-specific tuning necessitates a complete retraining of the model, which is cost-expensive and unacceptable for deployment in the downstream tasks. In this paper, we reformulate this issue as a length extrapolation problem, where token sequence length varies while maintaining a consistent patch size for images of different sizes. To this end, we propose Scalable Bias-Mode Attention Mask (BA-SAM) to enhance SAM's adaptability to varying image resolutions while eliminating the need for structure modifications. Firstly, we introduce a new scaling factor to ensure consistent magnitude in the attention layer's dot product values when the token sequence length changes. Secondly, we present a bias-mode attention mask that allows each token to prioritize neighboring information, mitigating the impact of untrained distant information. Our BA-SAM demonstrates efficacy in two scenarios: zero-shot and fine-tuning. Extensive evaluation on diverse datasets, including DIS5K, DUTS, ISIC, COD10K, and COCO, reveals its ability to significantly mitigate performance degradation in the zero-shot setting and achieve state-of-the-art performance with minimal fine-tuning. Furthermore, we propose a generalized model and benchmark, showcasing BA-SAM's generalizability across all four datasets simultaneously.

翻译：本文针对分段任意模型（SAM）在处理图像分辨率变化时面临的挑战展开研究。以零样本泛化能力著称的SAM，在应对不同尺寸图像数据集时会出现性能下降。现有方法通常将图像调整至固定尺寸或采用结构修改策略，这阻碍了SAM丰富先验知识的保留。此外，这种任务特定微调需要完全重新训练模型，成本高昂且难以在下游任务中部署。本文将该问题重新定义为长度外推问题——在保持图像不同尺寸下块大小一致的同时，令牌序列长度发生变化。为此，我们提出可扩展偏置模式注意力掩码（BA-SAM），在无需修改模型结构的前提下增强SAM对不同图像分辨率的适应性。首先引入新缩放因子，确保令牌序列长度变化时注意力层点积值的量级一致。其次提出偏置模式注意力掩码，使每个令牌优先关注邻近信息，减轻未训练远距离信息的影响。BA-SAM在零样本与微调两种场景中均展现出有效性。在DIS5K、DUTS、ISIC、COD10K和COCO等多数据集上的大量评估表明，该方法能显著缓解零样本设置下的性能下降，并通过最小化微调实现最优性能。此外，我们提出通用模型与基准测试，验证了BA-SAM同时在四个数据集上的泛化能力。