Segment Anything Model (SAM) has attracted significant attention due to its impressive zero-shot transfer performance and high versatility for numerous vision applications (like image editing with fine-grained control). Many of such applications need to be run on resource-constraint edge devices, like mobile phones. In this work, we aim to make SAM mobile-friendly by replacing the heavyweight image encoder with a lightweight one. A naive way to train such a new SAM as in the original SAM paper leads to unsatisfactory performance, especially when limited training sources are available. We find that this is mainly caused by the coupled optimization of the image encoder and mask decoder, motivated by which we propose decoupled distillation. Concretely, we distill the knowledge from the heavy image encoder (ViT-H in the original SAM) to a lightweight image encoder, which can be automatically compatible with the mask decoder in the original SAM. The training can be completed on a single GPU within less than one day, and the resulting lightweight SAM is termed MobileSAM which is more than 60 times smaller yet performs on par with the original SAM. For inference speed, With a single GPU, MobileSAM runs around 10ms per image: 8ms on the image encoder and 4ms on the mask decoder. With superior performance, our MobileSAM is around 5 times faster than the concurrent FastSAM and 7 times smaller, making it more suitable for mobile applications. Moreover, we show that MobileSAM can run relatively smoothly on CPU. The code for our project is provided at \href{https://github.com/ChaoningZhang/MobileSAM}{\textcolor{red}{MobileSAM}}), with a demo showing that MobileSAM can run relatively smoothly on CPU.
翻译:Segment Anything Model(SAM)因其在众多视觉应用(如具有精细控制的图像编辑)中出色的零样本迁移性能和高通用性而备受关注。其中许多应用需要在资源受限的边缘设备(如手机)上运行。本文旨在通过将重载图像编码器替换为轻量级编码器,使SAM适用于移动设备。如原始SAM论文所述,直接训练此类新SAM会导致性能不佳,尤其是在训练资源有限的情况下。我们发现这主要由图像编码器和掩码解码器的联合优化导致,为此我们提出了解耦蒸馏。具体而言,我们将知识从重载图像编码器(原始SAM中的ViT-H)蒸馏到轻量级图像编码器,该编码器可自动与原始SAM中的掩码解码器兼容。训练可在单个GPU上一天内完成,所得轻量级SAM称为MobileSAM,其体积缩小60倍以上,且性能与原始SAM相当。在推理速度方面,使用单个GPU时,MobileSAM每张图像运行约10毫秒:图像编码器8毫秒,掩码解码器4毫秒。凭借卓越性能,我们的MobileSAM比同期提出的FastSAM快约5倍,体积小7倍,因此更适合移动应用。此外,我们证明了MobileSAM可在CPU上相对流畅地运行。项目代码已发布在\href{https://github.com/ChaoningZhang/MobileSAM}{\textcolor{red}{MobileSAM}},其中演示显示MobileSAM可在CPU上相对流畅地运行。