Unleashing Text-to-Image Diffusion Models for Visual Perception

Diffusion models (DMs) have become the new trend of generative models and have demonstrated a powerful ability of conditional synthesis. Among those, text-to-image diffusion models pre-trained on large-scale image-text pairs are highly controllable by customizable prompts. Unlike the unconditional generative models that focus on low-level attributes and details, text-to-image diffusion models contain more high-level knowledge thanks to the vision-language pre-training. In this paper, we propose VPD (Visual Perception with a pre-trained Diffusion model), a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks. Instead of using the pre-trained denoising autoencoder in a diffusion-based pipeline, we simply use it as a backbone and aim to study how to take full advantage of the learned knowledge. Specifically, we prompt the denoising decoder with proper textual inputs and refine the text features with an adapter, leading to a better alignment to the pre-trained stage and making the visual contents interact with the text prompts. We also propose to utilize the cross-attention maps between the visual features and the text features to provide explicit guidance. Compared with other pre-training methods, we show that vision-language pre-trained diffusion models can be faster adapted to downstream visual perception tasks using the proposed VPD. Extensive experiments on semantic segmentation, referring image segmentation and depth estimation demonstrates the effectiveness of our method. Notably, VPD attains 0.254 RMSE on NYUv2 depth estimation and 73.3% oIoU on RefCOCO-val referring image segmentation, establishing new records on these two benchmarks. Code is available at https://github.com/wl-zhao/VPD

翻译：扩散模型已成为生成模型的新趋势，展现出强大的条件合成能力。其中，在大规模图像-文本对预训练的文本到图像扩散模型可通过可定制提示实现高度可控性。与聚焦低层属性和细节的无条件生成模型不同，文本到图像扩散模型通过视觉语言预训练蕴含更多高层知识。本文提出VPD（基于预训练扩散模型的视觉感知）框架，该框架在视觉感知任务中利用预训练文本到图像扩散模型的语义信息。我们并未沿袭扩散流程中采用预训练去噪自编码器的常规做法，而是直接将其作为骨干网络，旨在研究如何充分利用所学知识。具体而言，我们通过恰当文本输入驱动去噪解码器，并利用适配器优化文本特征，使其与预训练阶段更对齐，从而促进视觉内容与文本提示的交互。同时，我们提出利用视觉特征与文本特征间的交叉注意力图提供显式引导。与其他预训练方法相比，我们证明基于视觉语言预训练的扩散模型可通过VPD更快适配下游视觉感知任务。在语义分割、指代图像分割和深度估计上的大量实验验证了该方法有效性。值得注意的是，VPD在NYUv2深度估计上达到0.254 RMSE，在RefCOCO-val指代图像分割上达到73.3% oIoU，分别刷新这两项基准的纪录。代码开源地址：https://github.com/wl-zhao/VPD

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

51+阅读 · 2022年10月2日

【CVPR 2022】基于Transformer的图象风格化，StyTr2: Image Style Transfer with Transformers

专知会员服务

11+阅读 · 2022年3月19日

最新《Transformers模型》教程，64页ppt

专知会员服务

326+阅读 · 2020年11月26日