The variety of objects in the real world is nearly unlimited and is thus impossible to capture using models trained on a fixed set of categories. As a result, in recent years, open-vocabulary methods have attracted the interest of the community. This paper proposes a new method for zero-shot open-vocabulary segmentation. Prior work largely relies on contrastive training using image-text pairs, leveraging grouping mechanisms to learn image features that are both aligned with language and well-localised. This however can introduce ambiguity as the visual appearance of images with similar captions often varies. Instead, we leverage the generative properties of large-scale text-to-image diffusion models to sample a set of support images for a given textual category. This provides a distribution of appearances for a given text circumventing the ambiguity problem. We further propose a mechanism that considers the contextual background of the sampled images to better localise objects and segment the background directly. We show that our method can be used to ground several existing pre-trained self-supervised feature extractors in natural language and provide explainable predictions by mapping back to regions in the support set. Our proposal is training-free, relying on pre-trained components only, yet, shows strong performance on a range of open-vocabulary segmentation benchmarks, obtaining a lead of more than 10% on the Pascal VOC benchmark.
翻译:现实世界中的物体种类近乎无限,因此无法使用固定类别训练的模型来捕捉。近年来,开放词汇方法引起了学界的广泛关注。本文提出了一种新的零样本开放词汇分割方法。先前的工作主要依赖于图像-文本对的对比训练,利用分组机制学习与语言对齐且定位良好的图像特征。然而,由于具有相似标题的图像视觉外观常常存在差异,这可能会引入歧义。相反,我们利用大规模文本到图像扩散模型的生成特性,为给定的文本类别采样一组支持图像。这为给定文本提供了外观分布,从而规避了歧义问题。我们进一步提出了一种机制,该机制考虑了采样图像的上下文背景,以更好地定位物体并直接分割背景。我们展示的方法能够将多个现有的预训练自监督特征提取器在自然语言中接地,并通过映射回支持集中的区域提供可解释的预测。我们的方案无需训练,仅依赖预训练组件,然而在多个开放词汇分割基准上展现出强劲性能,在Pascal VOC基准上领先超过10%。