We present EfficientViT-SAM, a new family of accelerated segment anything models. We retain SAM's lightweight prompt encoder and mask decoder while replacing the heavy image encoder with EfficientViT. For the training, we begin with the knowledge distillation from the SAM-ViT-H image encoder to EfficientViT. Subsequently, we conduct end-to-end training on the SA-1B dataset. Benefiting from EfficientViT's efficiency and capacity, EfficientViT-SAM delivers 48.9x measured TensorRT speedup on A100 GPU over SAM-ViT-H without sacrificing performance. Our code and pre-trained models are released at https://github.com/mit-han-lab/efficientvit.
翻译:我们提出EfficientViT-SAM,一种新的加速分割万物模型家族。本方法保留SAM的轻量级提示编码器和掩码解码器,同时将重型图像编码器替换为EfficientViT。在训练过程中,我们首先通过知识蒸馏将SAM-ViT-H图像编码器的能力迁移至EfficientViT,随后在SA-1B数据集上进行端到端训练。得益于EfficientViT的高效率与高容量,EfficientViT-SAM在A100 GPU上实现相比SAM-ViT-H 48.9倍的实际TensorRT加速,且未牺牲性能。我们的代码与预训练模型已开源至https://github.com/mit-han-lab/efficientvit。