Powered by large-scale pre-training, vision foundation models exhibit significant potential in open-world image understanding. However, unlike large language models that excel at directly tackling various language tasks, vision foundation models require a task-specific model structure followed by fine-tuning on specific tasks. In this work, we present Matcher, a novel perception paradigm that utilizes off-the-shelf vision foundation models to address various perception tasks. Matcher can segment anything by using an in-context example without training. Additionally, we design three effective components within the Matcher framework to collaborate with these foundation models and unleash their full potential in diverse perception tasks. Matcher demonstrates impressive generalization performance across various segmentation tasks, all without training. For example, it achieves 52.7% mIoU on COCO-20$^i$ with one example, surpassing the state-of-the-art specialist model by 1.6%. In addition, Matcher achieves 33.0% mIoU on the proposed LVIS-92$^i$ for one-shot semantic segmentation, outperforming the state-of-the-art generalist model by 14.4%. Our visualization results further showcase the open-world generality and flexibility of Matcher when applied to images in the wild. Our code can be found at https://github.com/aim-uofa/Matcher.
翻译:基于大规模预训练的视觉基础模型在开放世界图像理解中展现出巨大潜力。然而,与大语言模型能直接处理各类语言任务不同,视觉基础模型通常需要针对任务设计特定结构并进行微调。本文提出匹配器——一种利用现成视觉基础模型解决多种感知任务的新型感知范式。匹配器无需训练,仅通过上下文示例即可完成任意目标分割。此外,我们在匹配器框架内设计了三个有效组件,协同这些基础模型以充分释放其在多样化感知任务中的潜力。匹配器在各类分割任务中展现出令人瞩目的泛化性能,且无需训练。例如,在COCO-20$^i$数据集上,仅用一个示例即可达到52.7% mIoU,超越当前最先进的专用模型1.6%;在提出的LVIS-92$^i$一次性语义分割任务中,匹配器达到33.0% mIoU,超越最先进的通用模型14.4%。可视化结果进一步展示了匹配器在真实场景图像中的开放世界通用性与灵活性。我们的代码已开源:https://github.com/aim-uofa/Matcher。