We present EfficientViT-SAM, a new family of accelerated segment anything models. We retain SAM's lightweight prompt encoder and mask decoder while replacing the heavy image encoder with EfficientViT. For the training, we begin with the knowledge distillation from the SAM-ViT-H image encoder to EfficientViT. Subsequently, we conduct end-to-end training on the SA-1B dataset. Benefiting from EfficientViT's efficiency and capacity, EfficientViT-SAM delivers 48.9x measured TensorRT speedup on A100 GPU over SAM-ViT-H without sacrificing performance. Our code and pre-trained models are released at https://github.com/mit-han-lab/efficientvit.
翻译:我们提出EfficientViT-SAM,一种新的加速分割任何物体模型家族。该方法保留SAM的轻量级提示编码器和掩码解码器,同时用EfficientViT替代其重型图像编码器。在训练过程中,我们首先将SAM-ViT-H图像编码器的知识蒸馏至EfficientViT,随后在SA-1B数据集上进行端到端训练。得益于EfficientViT的高效性与容量,EfficientViT-SAM在A100 GPU上相较SAM-ViT-H实现了48.9倍实测TensorRT加速,且未损失精度。我们的代码与预训练模型已发布在https://github.com/mit-han-lab/efficientvit。