Powered by large-scale pre-training, vision foundation models exhibit significant potential in open-world image understanding. Even though individual models have limited capabilities, combining multiple such models properly can lead to positive synergies and unleash their full potential. In this work, we present Matcher, which segments anything with one shot by integrating an all-purpose feature extraction model and a class-agnostic segmentation model. Naively connecting the models results in unsatisfying performance, e.g., the models tend to generate matching outliers and false-positive mask fragments. To address these issues, we design a bidirectional matching strategy for accurate cross-image semantic dense matching and a robust prompt sampler for mask proposal generation. In addition, we propose a novel instance-level matching strategy for controllable mask merging. The proposed Matcher method delivers impressive generalization performance across various segmentation tasks, all without training. For example, it achieves 52.7% mIoU on COCO-20$^i$ for one-shot semantic segmentation, surpassing the state-of-the-art specialist model by 1.6%. In addition, our visualization results show open-world generality and flexibility on images in the wild. The code shall be released at https://github.com/aim-uofa/Matcher.
翻译:受大规模预训练驱动,视觉基础模型在开放世界图像理解中展现出显著潜力。尽管单个模型能力有限,但合理组合多个此类模型可产生协同效应并释放其全部潜力。本文提出Matcher方法,通过集成通用特征提取模型与类别无关分割模型,实现一次性任意目标分割。直接连接这些模型会导致性能不佳,例如易产生匹配离群点和假阳性掩膜碎片。针对这些问题,我们设计了双向匹配策略以实现精确的跨图像语义密集匹配,并开发了鲁棒提示采样器用于掩膜提案生成。此外,我们提出了一种新颖的实例级匹配策略以实现可控掩膜融合。所提Matcher方法在无需训练的条件下,在各种分割任务中展现出令人印象深刻的泛化性能。例如,在COCO-20$^i$数据集的一次性语义分割任务中达到52.7%的mIoU,超越当前最先进的专用模型1.6%。同时,可视化结果揭示了该方法在自然图像上的开放世界通用性与灵活性。代码将于https://github.com/aim-uofa/Matcher 开源。