The recent Segment Anything Model (SAM) represents a big leap in scaling up segmentation models, allowing for powerful zero-shot capabilities and flexible prompting. Despite being trained with 1.1 billion masks, SAM's mask prediction quality falls short in many cases, particularly when dealing with objects that have intricate structures. We propose HQ-SAM, equipping SAM with the ability to accurately segment any object, while maintaining SAM's original promptable design, efficiency, and zero-shot generalizability. Our careful design reuses and preserves the pre-trained model weights of SAM, while only introducing minimal additional parameters and computation. We design a learnable High-Quality Output Token, which is injected into SAM's mask decoder and is responsible for predicting the high-quality mask. Instead of only applying it on mask-decoder features, we first fuse them with early and final ViT features for improved mask details. To train our introduced learnable parameters, we compose a dataset of 44K fine-grained masks from several sources. HQ-SAM is only trained on the introduced detaset of 44k masks, which takes only 4 hours on 8 GPUs. We show the efficacy of HQ-SAM in a suite of 9 diverse segmentation datasets across different downstream tasks, where 7 out of them are evaluated in a zero-shot transfer protocol. Our code and models will be released at https://github.com/SysCV/SAM-HQ.
翻译:近期提出的分割一切模型(SAM)代表了可扩展分割模型的重大飞跃,展现出强大的零样本能力和灵活的提示功能。尽管该模型使用11亿个掩码进行训练,但其掩码预测质量在许多情况下仍显不足,尤其在处理具有复杂结构的物体时。我们提出HQ-SAM,在保持SAM原有可提示设计、高效性和零样本泛化能力的基础上,赋予其精准分割任意物体的能力。我们的精心设计复用并保留了SAM的预训练模型权重,仅引入极少量额外参数与计算量。我们设计了一个可学习的高质量输出标记,将其注入SAM的掩码解码器,用于预测高质量掩码。不同于仅将其应用于掩码解码器特征,我们首先将这些特征与早期及最终ViT特征融合,以提升掩码细节。为训练引入的可学习参数,我们构建了包含44K细粒度掩码的多来源数据集。HQ-SAM仅在该44K掩码数据集上训练,8块GPU仅需4小时。我们在涵盖不同下游任务的9个多样化分割数据集上验证了HQ-SAM的有效性,其中7个数据集采用零样本迁移评估。我们的代码和模型将发布于https://github.com/SysCV/SAM-HQ。