In this paper, we propose a novel Visual Reference Prompt (VRP) encoder that empowers the Segment Anything Model (SAM) to utilize annotated reference images as prompts for segmentation, creating the VRP-SAM model. In essence, VRP-SAM can utilize annotated reference images to comprehend specific objects and perform segmentation of specific objects in target image. It is note that the VRP encoder can support a variety of annotation formats for reference images, including \textbf{point}, \textbf{box}, \textbf{scribble}, and \textbf{mask}. VRP-SAM achieves a breakthrough within the SAM framework by extending its versatility and applicability while preserving SAM's inherent strengths, thus enhancing user-friendliness. To enhance the generalization ability of VRP-SAM, the VRP encoder adopts a meta-learning strategy. To validate the effectiveness of VRP-SAM, we conducted extensive empirical studies on the Pascal and COCO datasets. Remarkably, VRP-SAM achieved state-of-the-art performance in visual reference segmentation with minimal learnable parameters. Furthermore, VRP-SAM demonstrates strong generalization capabilities, allowing it to perform segmentation of unseen objects and enabling cross-domain segmentation.
翻译:本文提出了一种新颖的视觉参考提示(VRP)编码器,该编码器使分割一切模型(SAM)能够利用标注的参考图像作为分割提示,从而创建出VRP-SAM模型。本质上,VRP-SAM可利用标注参考图像理解特定目标,并对目标图像中的特定目标进行分割。值得注意的是,VRP编码器支持参考图像多种标注格式,包括\textbf{点}、\textbf{方框}、\textbf{涂鸦}和\textbf{掩码}。VRP-SAM在SAM框架内实现了突破,在保持SAM固有优势的同时扩展了其通用性和适用性,从而提升了用户友好性。为增强VRP-SAM的泛化能力,VRP编码器采用了元学习策略。为验证VRP-SAM的有效性,我们在Pascal和COCO数据集上进行了大量实证研究。值得注意的是,VRP-SAM以最少的可学习参数在视觉参考分割任务中达到了最先进性能。此外,VRP-SAM展现出强大的泛化能力,能够实现对未见目标的分割并支持跨域分割。