Vision-Language Models (VLMs), such as CLIP, have demonstrated impressive zero-shot transfer capabilities in image-level visual perception. However, these models have shown limited performance in instance-level tasks that demand precise localization and recognition. Previous works have suggested that incorporating visual prompts, such as colorful boxes or circles, can improve the ability of models to recognize objects of interest. Nonetheless, compared to language prompting, visual prompting designs are rarely explored. Existing approaches, which employ coarse visual cues such as colorful boxes or circles, often result in sub-optimal performance due to the inclusion of irrelevant and noisy pixels. In this paper, we carefully study the visual prompting designs by exploring more fine-grained markings, such as segmentation masks and their variations. In addition, we introduce a new zero-shot framework that leverages pixel-level annotations acquired from a generalist segmentation model for fine-grained visual prompting. Consequently, our investigation reveals that a straightforward application of blur outside the target mask, referred to as the Blur Reverse Mask, exhibits exceptional effectiveness. This proposed prompting strategy leverages the precise mask annotations to reduce focus on weakly related regions while retaining spatial coherence between the target and the surrounding background. Our Fine-Grained Visual Prompting (FGVP) demonstrates superior performance in zero-shot comprehension of referring expressions on the RefCOCO, RefCOCO+, and RefCOCOg benchmarks. It outperforms prior methods by an average margin of 3.0% to 4.6%, with a maximum improvement of 12.5% on the RefCOCO+ testA subset. The part detection experiments conducted on the PACO dataset further validate the preponderance of FGVP over existing visual prompting techniques. Code and models will be made available.
翻译:视觉-语言模型(VLMs),例如CLIP,在图像级视觉感知中展现出强大的零样本迁移能力。然而,这些模型在需要精确定位与识别的实例级任务中表现有限。先前研究表明,引入视觉提示(如彩色框或圆圈)可提升模型识别目标物体的能力。但相较于语言提示,视觉提示设计鲜少被深入探索。现有方法采用彩色框或圆圈等粗粒度视觉线索,常因包含无关噪声像素而导致次优性能。本文通过探索更精细的标注(如分割掩码及其变体),系统研究了视觉提示设计。此外,我们提出了一种新的零样本框架,利用通用分割模型获取的像素级标注实现精细视觉提示。研究发现,在目标掩码外部直接应用模糊处理(称为模糊反向掩码)展现出卓越效果。该提示策略利用精准掩码标注,在降低弱相关区域关注度的同时,保持目标与周围背景的空间连贯性。我们的精细视觉提示(FGVP)在RefCOCO、RefCOCO+和RefCOCOg基准数据集的指代表达理解零样本任务中表现优异,相较于先前方法平均提升3.0%至4.6%,在RefCOCO+ testA子集上最高提升12.5%。在PACO数据集上的部件检测实验进一步验证了FGVP相较于现有视觉提示技术的优势。代码与模型将开源。