FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models

Foundation models have exhibited unprecedented capabilities in tackling many domains and tasks. Models such as CLIP are currently widely used to bridge cross-modal representations, and text-to-image diffusion models are arguably the leading models in terms of realistic image generation. Image generative models are trained on massive datasets that provide them with powerful internal spatial representations. In this work, we explore the potential benefits of such representations, beyond image generation, in particular, for dense visual prediction tasks. We focus on the task of image segmentation, which is traditionally solved by training models on closed-vocabulary datasets, with pixel-level annotations. To avoid the annotation cost or training large diffusion models, we constraint our setup to be zero-shot and training-free. In a nutshell, our pipeline leverages different and relatively small-sized, open-source foundation models for zero-shot open-vocabulary segmentation. The pipeline is as follows: the image is passed to both a captioner model (i.e. BLIP) and a diffusion model (i.e., Stable Diffusion Model) to generate a text description and visual representation, respectively. The features are clustered and binarized to obtain class agnostic masks for each object. These masks are then mapped to a textual class, using the CLIP model to support open-vocabulary. Finally, we add a refinement step that allows to obtain a more precise segmentation mask. Our approach (dubbed FreeSeg-Diff), which does not rely on any training, outperforms many training-based approaches on both Pascal VOC and COCO datasets. In addition, we show very competitive results compared to the recent weakly-supervised segmentation approaches. We provide comprehensive experiments showing the superiority of diffusion model features compared to other pretrained models. Project page: https://bcorrad.github.io/freesegdiff/

翻译：基础模型已展现出处理多领域和多任务的空前能力。诸如CLIP等模型当前被广泛用于桥接跨模态表征，而文本到图像扩散模型在逼真图像生成领域堪称领先模型。图像生成模型在大规模数据集上训练，赋予了它们强大的内部空间表征能力。在本研究中，我们探索了此类表征在图像生成之外的潜在应用价值，特别是针对密集视觉预测任务。我们聚焦于图像分割任务——传统上通过基于像素级标注的封闭词汇数据集训练模型来解决该问题。为避免标注成本或训练大型扩散模型，我们将实验设定为零样本且无训练约束。简言之，我们的流程利用不同尺寸较小、开源的基础模型进行零样本开放词汇分割。具体流程如下：将图像分别输入标题生成模型（如BLIP）和扩散模型（如Stable Diffusion Model），以生成文本描述和视觉表征。通过对特征进行聚类和二值化处理，获得每个目标的类别无关掩码。随后利用CLIP模型将这些掩码映射到文本类别，以支持开放词汇。最后，我们添加了精炼步骤以获得更精确的分割掩码。我们提出的方法（称为FreeSeg-Diff）完全不依赖任何训练，在Pascal VOC和COCO数据集上均优于多种基于训练的方法。此外，与近期弱监督分割方法相比，我们展示了极具竞争力的结果。我们通过全面实验证明了扩散模型特征相较于其他预训练模型的优越性。项目页面：https://bcorrad.github.io/freesegdiff/

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日