BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model

In this paper, we address the challenge of image resolution variation for the Segment Anything Model (SAM). SAM, known for its zero-shot generalizability, exhibits a performance degradation when faced with datasets with varying image sizes. Previous approaches tend to resize the image to a fixed size or adopt structure modifications, hindering the preservation of SAM's rich prior knowledge. Besides, such task-specific tuning necessitates a complete retraining of the model, which is cost-expensive and unacceptable for deployment in the downstream tasks. In this paper, we reformulate this issue as a length extrapolation problem, where token sequence length varies while maintaining a consistent patch size for images of different sizes. To this end, we propose Scalable Bias-Mode Attention Mask (BA-SAM) to enhance SAM's adaptability to varying image resolutions while eliminating the need for structure modifications. Firstly, we introduce a new scaling factor to ensure consistent magnitude in the attention layer's dot product values when the token sequence length changes. Secondly, we present a bias-mode attention mask that allows each token to prioritize neighboring information, mitigating the impact of untrained distant information. Our BA-SAM demonstrates efficacy in two scenarios: zero-shot and fine-tuning. Extensive evaluation on diverse datasets, including DIS5K, DUTS, ISIC, COD10K, and COCO, reveals its ability to significantly mitigate performance degradation in the zero-shot setting and achieve state-of-the-art performance with minimal fine-tuning. Furthermore, we propose a generalized model and benchmark, showcasing BA-SAM's generalizability across all four datasets simultaneously. Code is available at https://github.com/zongzi13545329/BA-SAM

翻译：本文针对任意分割模型（SAM）在图像分辨率变化场景下的挑战展开研究。尽管SAM以零样本泛化能力著称，但在处理多尺寸图像数据集时会出现性能退化。现有方法通常将图像调整为固定尺寸或修改网络结构，这难以保留SAM丰富的先验知识。此外，此类任务特定的调优需要对模型进行完整重训练，成本高昂且难以部署于下游任务。本文将该问题重新定义为长度外推问题——在保持图像块尺寸一致的前提下，不同尺寸图像的令牌序列长度呈现动态变化。为此，我们提出可扩展偏置模式注意力掩码（BA-SAM），无需修改网络结构即可增强SAM对不同图像分辨率的适应性。首先引入新缩放因子，确保注意力层点积值在令牌序列长度变化时保持量级一致；其次提出偏置模式注意力掩码，使每个令牌优先关注邻近信息，缓解未训练远距离信息的影响。BA-SAM在零样本和微调两种场景下均展现有效性。在DIS5K、DUTS、ISIC、COD10K及COCO等多数据集的全面评估中，该方法可显著缓解零样本设置下的性能退化，且通过极少量微调即可达到最优性能。我们进一步提出通用模型与基准，验证了BA-SAM在全部四个数据集上同时保持泛化能力。代码已开源：https://github.com/zongzi13545329/BA-SAM