MobileSAMv2: Faster Segment Anything to Everything

Segment anything model (SAM) addresses two practical yet challenging segmentation tasks: \textbf{segment anything (SegAny)}, which utilizes a certain point to predict the mask for a single object of interest, and \textbf{segment everything (SegEvery)}, which predicts the masks for all objects on the image. What makes SegAny slow for SAM is its heavyweight image encoder, which has been addressed by MobileSAM via decoupled knowledge distillation. The efficiency bottleneck of SegEvery with SAM, however, lies in its mask decoder because it needs to first generate numerous masks with redundant grid-search prompts and then perform filtering to obtain the final valid masks. We propose to improve its efficiency by directly generating the final masks with only valid prompts, which can be obtained through object discovery. Our proposed approach not only helps reduce the total time on the mask decoder by at least 16 times but also achieves superior performance. Specifically, our approach yields an average performance boost of 3.6\% (42.5\% \textit{v.s.} 38.9\%) for zero-shot object proposal on the LVIS dataset with the mask AR@$K$ metric. Qualitative results show that our approach generates fine-grained masks while avoiding over-segmenting things. This project targeting faster SegEvery than the original SAM is termed MobileSAMv2 to differentiate from MobileSAM which targets faster SegAny. Moreover, we demonstrate that our new prompt sampling is also compatible with the distilled image encoders in MobileSAM, contributing to a unified framework for efficient SegAny and SegEvery. The code is available at the same link as MobileSAM Project \href{https://github.com/ChaoningZhang/MobileSAM}{\textcolor{red}{https://github.com/ChaoningZhang/MobileSAM}}. \end{abstract}

翻译：分割一切模型（SAM）解决了两个实际且具有挑战性的分割任务：**分割任意物体（SegAny）**，即利用特定点预测单个感兴趣对象的掩码；以及**分割所有物体（SegEvery）**，即预测图像中所有对象的掩码。对于SAM而言，SegAny速度慢的原因在于其重量级的图像编码器，而MobileSAM通过解耦知识蒸馏解决了这一问题。然而，SAM执行SegEvery的效率瓶颈在于其掩码解码器，因为它需要先生成大量带有冗余网格搜索提示的掩码，然后进行筛选以获得最终有效的掩码。我们提出通过直接利用仅含有效提示的方式生成最终掩码来提高效率，这些有效提示可通过物体发现获得。我们提出的方法不仅有助于将掩码解码器的总时间至少减少16倍，而且还能实现更优的性能。具体来说，我们的方法在LVIS数据集上使用掩码AR@$K$指标进行零样本物体提议时，平均性能提升了3.6%（42.5%对比38.9%）。定性结果表明，我们的方法生成了精细的掩码，同时避免了过度分割。该项目旨在实现比原始SAM更快的SegEvery，命名为MobileSAMv2，以区别于旨在实现更快SegAny的MobileSAM。此外，我们证明了这种新的提示采样也与MobileSAM中的蒸馏图像编码器兼容，从而为高效的SegAny和SegEvery构建了统一框架。代码已开源，链接与MobileSAM项目相同：\href{https://github.com/ChaoningZhang/MobileSAM}{\textcolor{red}{https://github.com/ChaoningZhang/MobileSAM}}。