Foundation models have exhibited unprecedented capabilities in tackling many domains and tasks. Models such as CLIP are currently widely used to bridge cross-modal representations, and text-to-image diffusion models are arguably the leading models in terms of realistic image generation. Image generative models are trained on massive datasets that provide them with powerful internal spatial representations. In this work, we explore the potential benefits of such representations, beyond image generation, in particular, for dense visual prediction tasks. We focus on the task of image segmentation, which is traditionally solved by training models on closed-vocabulary datasets, with pixel-level annotations. To avoid the annotation cost or training large diffusion models, we constraint our setup to be zero-shot and training-free. In a nutshell, our pipeline leverages different and relatively small-sized, open-source foundation models for zero-shot open-vocabulary segmentation. The pipeline is as follows: the image is passed to both a captioner model (i.e. BLIP) and a diffusion model (i.e., Stable Diffusion Model) to generate a text description and visual representation, respectively. The features are clustered and binarized to obtain class agnostic masks for each object. These masks are then mapped to a textual class, using the CLIP model to support open-vocabulary. Finally, we add a refinement step that allows to obtain a more precise segmentation mask. Our approach (dubbed FreeSeg-Diff), which does not rely on any training, outperforms many training-based approaches on both Pascal VOC and COCO datasets. In addition, we show very competitive results compared to the recent weakly-supervised segmentation approaches. We provide comprehensive experiments showing the superiority of diffusion model features compared to other pretrained models. Project page: https://bcorrad.github.io/freesegdiff/
翻译:基础模型已展现出处理多领域和多任务的空前能力。诸如CLIP等模型当前被广泛用于桥接跨模态表征,而文本到图像扩散模型在逼真图像生成领域堪称领先模型。图像生成模型在大规模数据集上训练,赋予了它们强大的内部空间表征能力。在本研究中,我们探索了此类表征在图像生成之外的潜在应用价值,特别是针对密集视觉预测任务。我们聚焦于图像分割任务——传统上通过基于像素级标注的封闭词汇数据集训练模型来解决该问题。为避免标注成本或训练大型扩散模型,我们将实验设定为零样本且无训练约束。简言之,我们的流程利用不同尺寸较小、开源的基础模型进行零样本开放词汇分割。具体流程如下:将图像分别输入标题生成模型(如BLIP)和扩散模型(如Stable Diffusion Model),以生成文本描述和视觉表征。通过对特征进行聚类和二值化处理,获得每个目标的类别无关掩码。随后利用CLIP模型将这些掩码映射到文本类别,以支持开放词汇。最后,我们添加了精炼步骤以获得更精确的分割掩码。我们提出的方法(称为FreeSeg-Diff)完全不依赖任何训练,在Pascal VOC和COCO数据集上均优于多种基于训练的方法。此外,与近期弱监督分割方法相比,我们展示了极具竞争力的结果。我们通过全面实验证明了扩散模型特征相较于其他预训练模型的优越性。项目页面:https://bcorrad.github.io/freesegdiff/