This paper presents EdgeSAM, an accelerated variant of the Segment Anything Model (SAM), optimized for efficient execution on edge devices with minimal compromise in performance. Our approach involves distilling the original ViT-based SAM image encoder into a purely CNN-based architecture, better suited for edge devices. We carefully benchmark various distillation strategies and demonstrate that taskagnostic encoder distillation fails to capture the full knowledge embodied in SAM. To overcome this bottleneck, we include both the prompt encoder and mask decoder in the distillation process, with box and point prompts in the loop, so that the distilled model can accurately capture the intricate dynamics between user input and mask generation. To mitigate dataset bias issues stemming from point prompt distillation, we incorporate a lightweight module within the encoder. As a result, EdgeSAM achieves a 37-fold speed increase compared to the original SAM, and it also outperforms MobileSAM/EfficientSAM, being over 7 times as fast when deployed on edge devices while enhancing the mIoUs on COCO and LVIS by 2.3/1.5 and 3.1/1.6, respectively. It is also the first SAM variant that can run at over 30 FPS on an iPhone 14. Code and demo are available at https://www.mmlab-ntu.com/project/edgesam.
翻译:本文提出了EdgeSAM,这是Segment Anything Model(SAM)的一个加速变体,专为在边缘设备上高效执行而优化,同时将性能损失降至最低。我们的方法将原始的基于ViT的SAM图像编码器蒸馏为纯CNN架构,后者更适用于边缘设备。我们细致地评估了多种蒸馏策略,并证明任务无关的编码器蒸馏无法充分捕获SAM所蕴含的全部知识。为了克服这一瓶颈,我们将提示编码器和掩码解码器均纳入蒸馏过程,并在循环中包含框和点提示,从而使蒸馏后的模型能够准确捕获用户输入与掩码生成之间的复杂动态关系。为了缓解由点提示蒸馏引起的数据集偏差问题,我们在编码器中引入了一个轻量级模块。最终,EdgeSAM相比原始SAM实现了37倍的速度提升,并且在边缘设备上部署时,其速度也超过MobileSAM/EfficientSAM达7倍以上,同时在COCO和LVIS数据集上的mIoU分别提升了2.3/1.5和3.1/1.6。它也是首个能在iPhone 14上以超过30 FPS运行的SAM变体。代码和演示可在https://www.mmlab-ntu.com/project/edgesam获取。