The CLIP and Segment Anything Model (SAM) are remarkable vision foundation models (VFMs). SAM excels in segmentation tasks across diverse domains, while CLIP is renowned for its zero-shot recognition capabilities. This paper presents an in-depth exploration of integrating these two models into a unified framework. Specifically, we introduce the Open-Vocabulary SAM, a SAM-inspired model designed for simultaneous interactive segmentation and recognition, leveraging two unique knowledge transfer modules: SAM2CLIP and CLIP2SAM. The former adapts SAM's knowledge into the CLIP via distillation and learnable transformer adapters, while the latter transfers CLIP knowledge into SAM, enhancing its recognition capabilities. Extensive experiments on various datasets and detectors show the effectiveness of Open-Vocabulary SAM in both segmentation and recognition tasks, significantly outperforming the naive baselines of simply combining SAM and CLIP. Furthermore, aided with image classification data training, our method can segment and recognize approximately 22,000 classes.
翻译:CLIP和Segment Anything Model (SAM)是卓越的视觉基础模型(VFM)。SAM在跨域分割任务中表现出色,而CLIP以其零样本识别能力著称。本文深入探索了将这两个模型整合到统一框架中的方法。具体而言,我们提出了开放词汇SAM,这是一种受SAM启发的模型,通过两种独特的知识迁移模块(SAM2CLIP和CLIP2SAM)实现交互式分割与识别。前者通过知识蒸馏和可学习transformer适配器将SAM的知识适配到CLIP中,后者则将CLIP知识迁移至SAM以增强其识别能力。在多个数据集和检测器上的广泛实验表明,开放词汇SAM在分割和识别任务中均表现优异,显著优于简单组合SAM和CLIP的基线方法。此外,借助图像分类数据训练,我们的方法可分割并识别约22,000个类别。