Diffusion models (DMs) have become the new trend of generative models and have demonstrated a powerful ability of conditional synthesis. Among those, text-to-image diffusion models pre-trained on large-scale image-text pairs are highly controllable by customizable prompts. Unlike the unconditional generative models that focus on low-level attributes and details, text-to-image diffusion models contain more high-level knowledge thanks to the vision-language pre-training. In this paper, we propose VPD (Visual Perception with a pre-trained Diffusion model), a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks. Instead of using the pre-trained denoising autoencoder in a diffusion-based pipeline, we simply use it as a backbone and aim to study how to take full advantage of the learned knowledge. Specifically, we prompt the denoising decoder with proper textual inputs and refine the text features with an adapter, leading to a better alignment to the pre-trained stage and making the visual contents interact with the text prompts. We also propose to utilize the cross-attention maps between the visual features and the text features to provide explicit guidance. Compared with other pre-training methods, we show that vision-language pre-trained diffusion models can be faster adapted to downstream visual perception tasks using the proposed VPD. Extensive experiments on semantic segmentation, referring image segmentation and depth estimation demonstrates the effectiveness of our method. Notably, VPD attains 0.254 RMSE on NYUv2 depth estimation and 73.3% oIoU on RefCOCO-val referring image segmentation, establishing new records on these two benchmarks. Code is available at https://github.com/wl-zhao/VPD
翻译:扩散模型已成为生成模型的新趋势,展现出强大的条件合成能力。其中,在大规模图像-文本对预训练的文本到图像扩散模型可通过可定制提示实现高度可控性。与聚焦低层属性和细节的无条件生成模型不同,文本到图像扩散模型通过视觉语言预训练蕴含更多高层知识。本文提出VPD(基于预训练扩散模型的视觉感知)框架,该框架在视觉感知任务中利用预训练文本到图像扩散模型的语义信息。我们并未沿袭扩散流程中采用预训练去噪自编码器的常规做法,而是直接将其作为骨干网络,旨在研究如何充分利用所学知识。具体而言,我们通过恰当文本输入驱动去噪解码器,并利用适配器优化文本特征,使其与预训练阶段更对齐,从而促进视觉内容与文本提示的交互。同时,我们提出利用视觉特征与文本特征间的交叉注意力图提供显式引导。与其他预训练方法相比,我们证明基于视觉语言预训练的扩散模型可通过VPD更快适配下游视觉感知任务。在语义分割、指代图像分割和深度估计上的大量实验验证了该方法有效性。值得注意的是,VPD在NYUv2深度估计上达到0.254 RMSE,在RefCOCO-val指代图像分割上达到73.3% oIoU,分别刷新这两项基准的纪录。代码开源地址:https://github.com/wl-zhao/VPD