Open-vocabulary segmentation (OVS) extends the zero-shot recognition capabilities of vision-language models (VLMs) to pixel-level prediction, enabling segmentation of arbitrary categories specified by text prompts. Despite recent progress, OVS lags behind fully supervised approaches due to two challenges: the coarse image-level supervision used to train VLMs and the semantic ambiguity of natural language. We address these limitations by introducing a few-shot setting that augments textual prompts with a support set of pixel-annotated images. Building on this, we propose a retrieval-augmented test-time adapter that learns a lightweight, per-image classifier by fusing textual and visual support features. Unlike prior methods relying on late, hand-crafted fusion, our approach performs learned, per-query fusion, achieving stronger synergy between modalities. The method supports continually expanding support sets, and applies to fine-grained tasks such as personalized segmentation. Experiments show that we significantly narrow the gap between zero-shot and supervised segmentation while preserving open-vocabulary ability.
翻译:开放词汇分割(OVS)将视觉语言模型(VLM)的零样本识别能力扩展至像素级预测,使其能够分割由文本提示指定的任意类别。尽管近期取得进展,OVS仍落后于全监督方法,这主要源于两大挑战:用于训练VLM的粗粒度图像级监督,以及自然语言的语义模糊性。我们通过引入少样本设置来解决这些局限,该设置通过添加带有像素标注图像的支持集来增强文本提示。在此基础上,我们提出一种检索增强的测试时适配器,通过融合文本与视觉支持特征来学习轻量级的每图像分类器。与先前依赖后期手工融合的方法不同,我们的方法执行基于学习的每查询融合,实现了模态间更强的协同效应。该方法支持持续扩展的支持集,并适用于细粒度任务(如个性化分割)。实验表明,我们在保持开放词汇能力的同时,显著缩小了零样本分割与监督分割之间的性能差距。