BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model

In this paper, we address the challenge of image resolution variation for the Segment Anything Model (SAM). SAM, known for its zero-shot generalizability, exhibits a performance degradation when faced with datasets with varying image sizes. Previous approaches tend to resize the image to a fixed size or adopt structure modifications, hindering the preservation of SAM's rich prior knowledge. Besides, such task-specific tuning necessitates a complete retraining of the model, which is cost-expensive and unacceptable for deployment in the downstream tasks. In this paper, we reformulate this issue as a length extrapolation problem, where token sequence length varies while maintaining a consistent patch size for images of different sizes. To this end, we propose Scalable Bias-Mode Attention Mask (BA-SAM) to enhance SAM's adaptability to varying image resolutions while eliminating the need for structure modifications. Firstly, we introduce a new scaling factor to ensure consistent magnitude in the attention layer's dot product values when the token sequence length changes. Secondly, we present a bias-mode attention mask that allows each token to prioritize neighboring information, mitigating the impact of untrained distant information. Our BA-SAM demonstrates efficacy in two scenarios: zero-shot and fine-tuning. Extensive evaluation on diverse datasets, including DIS5K, DUTS, ISIC, COD10K, and COCO, reveals its ability to significantly mitigate performance degradation in the zero-shot setting and achieve state-of-the-art performance with minimal fine-tuning. Furthermore, we propose a generalized model and benchmark, showcasing BA-SAM's generalizability across all four datasets simultaneously.

翻译：本文针对图像分辨率变化对Segment Anything Model (SAM)的挑战展开研究。以零样本泛化能力著称的SAM，在应对不同尺寸图像数据集时会出现性能退化。现有方法通常将图像缩放至固定尺寸或修改模型结构，这阻碍了SAM丰富先验知识的保留。此外，此类任务特定的调优需要完整重训模型，其高昂的计算成本使得下游任务部署难以接受。本文将这个问题重新定义为长度外推问题——在保持图像不同尺寸下块尺寸一致的同时，令令牌序列长度可变。为此，我们提出可扩展偏置模式注意力掩码（BA-SAM），在无需修改模型结构的前提下增强SAM对不同图像分辨率的适应性。首先引入新缩放因子，确保令牌序列长度变化时注意力层点积值的量级保持一致；其次提出偏置模式注意力掩码，使每个令牌优先关注邻近信息，从而缓解未训练远距离信息的影响。BA-SAM在零样本和微调两种场景中均展现出有效性。在DIS5K、DUTS、ISIC、COD10K和COCO等多个数据集上的广泛评估表明，该方法在零样本设置中能显著缓解性能退化，并通过最少微调实现最先进性能。此外，我们提出通用化模型与基准测试，证明BA-SAM能同时在全部四个数据集上保持泛化能力。