Text-to-image diffusion models have shown powerful ability on conditional image synthesis. With large-scale vision-language pre-training, diffusion models are able to generate high-quality images with rich texture and reasonable structure under different text prompts. However, it is an open problem to adapt the pre-trained diffusion model for visual perception. In this paper, we propose an implicit and explicit language guidance framework for diffusion-based perception, named IEDP. Our IEDP comprises of an implicit language guidance branch and an explicit language guidance branch. The implicit branch employs frozen CLIP image encoder to directly generate implicit text embeddings that are fed to diffusion model, without using explicit text prompts. The explicit branch utilizes the ground-truth labels of corresponding images as text prompts to condition feature extraction of diffusion model. During training, we jointly train diffusion model by sharing the model weights of these two branches. As a result, implicit and explicit branches can jointly guide feature learning. During inference, we only employ implicit branch for final prediction, which does not require any ground-truth labels. Experiments are performed on two typical perception tasks, including semantic segmentation and depth estimation. Our IEDP achieves promising performance on both tasks. For semantic segmentation, our IEDP has the mIoU score of 55.9% on AD20K validation set, which outperforms the baseline method VPD by 2.2%. For depth estimation, our IEDP outperforms the baseline method VPD with a relative gain of 10.2%.
翻译:文本到图像扩散模型在条件图像生成方面展现出强大能力。通过大规模视觉-语言预训练,扩散模型能够根据不同文本提示生成纹理丰富、结构合理的高质量图像。然而,如何将预训练扩散模型适配到视觉感知任务仍是一个开放性问题。本文提出一种面向扩散感知的隐式和显式语言引导框架,命名为IEDP。该框架由隐式语言引导分支和显式语言引导分支组成。隐式分支使用冻结的CLIP图像编码器直接生成隐式文本嵌入并输入扩散模型,无需显式文本提示;显式分支则利用对应图像的真实标签作为文本提示,约束扩散模型的特征提取。训练过程中,我们通过共享两个分支的模型权重对扩散模型进行联合训练,从而使得隐式和显式分支能够共同引导特征学习。推理阶段仅使用隐式分支进行最终预测,无需任何真实标签。我们在语义分割和深度估计两项典型感知任务上开展实验,IEDP在两个任务上均取得令人满意的性能。在语义分割任务中,IEDP在AD20K验证集上达到55.9%的mIoU分数,较基线方法VPD提升2.2%;在深度估计任务中,IEDP相对基线方法VPD获得10.2%的性能增益。