The recent Segment Anything Model (SAM) represents a big leap in scaling up segmentation models, allowing for powerful zero-shot capabilities and flexible prompting. Despite being trained with 1.1 billion masks, SAM's mask prediction quality falls short in many cases, particularly when dealing with objects that have intricate structures. We propose HQ-SAM, equipping SAM with the ability to accurately segment any object, while maintaining SAM's original promptable design, efficiency, and zero-shot generalizability. Our careful design reuses and preserves the pre-trained model weights of SAM, while only introducing minimal additional parameters and computation. We design a learnable High-Quality Output Token, which is injected into SAM's mask decoder and is responsible for predicting the high-quality mask. Instead of only applying it on mask-decoder features, we first fuse them with early and final ViT features for improved mask details. To train our introduced learnable parameters, we compose a dataset of 44K fine-grained masks from several sources. HQ-SAM is only trained on the introduced detaset of 44k masks, which takes only 4 hours on 8 GPUs. We show the efficacy of HQ-SAM in a suite of 10 diverse segmentation datasets across different downstream tasks, where 8 out of them are evaluated in a zero-shot transfer protocol. Our code and pretrained models are at https://github.com/SysCV/SAM-HQ.
翻译:近期的分割一切模型(SAM)在推动分割模型规模化方面实现了重大突破,使得强大的零样本能力和灵活提示成为可能。尽管该模型基于11亿掩码进行训练,但在处理具有复杂结构的对象时,其掩码预测质量仍存在不足。我们提出HQ-SAM,在保持SAM原有可提示设计、高效性和零样本泛化能力的同时,赋予其精确分割任意物体的能力。通过精心设计,我们重用并保留了SAM的预训练模型权重,仅引入极少的额外参数和计算量。我们设计了一个可学习的高质量输出令牌,将其注入SAM的掩码解码器,专门负责预测高质量掩码。不同于仅将其应用于掩码解码器特征,我们首先将其与早期和最终ViT特征进行融合,以优化掩码细节。为训练引入的可学习参数,我们整合了来自多个来源的44K精细粒度掩码数据集。HQ-SAM仅在该44K掩码数据集上训练,在8块GPU上仅需4小时。我们在涵盖不同下游任务的10个多样化分割数据集上验证了HQ-SAM的有效性,其中8个数据集采用零样本迁移评估协议。我们的代码和预训练模型已开源:https://github.com/SysCV/SAM-HQ。