Text-to-image diffusion models have shown powerful ability on conditional image synthesis. With large-scale vision-language pre-training, diffusion models are able to generate high-quality images with rich texture and reasonable structure under different text prompts. However, it is an open problem to adapt the pre-trained diffusion model for visual perception. In this paper, we propose an implicit and explicit language guidance framework for diffusion-based perception, named IEDP. Our IEDP comprises an implicit language guidance branch and an explicit language guidance branch. The implicit branch employs frozen CLIP image encoder to directly generate implicit text embeddings that are fed to diffusion model, without using explicit text prompts. The explicit branch utilizes the ground-truth labels of corresponding images as text prompts to condition feature extraction of diffusion model. During training, we jointly train diffusion model by sharing the model weights of these two branches. As a result, implicit and explicit branches can jointly guide feature learning. During inference, we only employ implicit branch for final prediction, which does not require any ground-truth labels. Experiments are performed on two typical perception tasks, including semantic segmentation and depth estimation. Our IEDP achieves promising performance on both tasks. For semantic segmentation, our IEDP has the mIoU$^\text{ss}$ score of 55.9% on AD20K validation set, which outperforms the baseline method VPD by 2.2%. For depth estimation, our IEDP outperforms the baseline method VPD with a relative gain of 11.0%.
翻译:文本到图像扩散模型在条件图像合成方面展现了强大的能力。通过大规模视觉-语言预训练,扩散模型能够根据不同文本提示生成纹理丰富、结构合理的高质量图像。然而,如何将预训练扩散模型应用于视觉感知任务仍是一个开放性问题。本文提出一种基于扩散模型的隐式与显式语言引导框架,命名为IEDP。该框架包含隐式语言引导分支和显式语言引导分支:隐式分支利用冻结的CLIP图像编码器直接生成隐式文本嵌入,并将其输入扩散模型,无需显式文本提示;显式分支则利用对应图像的真实标签作为文本提示,以调节扩散模型的特征提取。训练过程中,我们通过共享两个分支的模型权重来联合训练扩散模型,从而隐式与显式分支共同引导特征学习。推理阶段仅使用隐式分支进行最终预测,无需任何真实标签。在语义分割和深度估计两个典型感知任务上的实验表明,我们的IEDP均取得了有竞争力的性能:在AD20K验证集上,语义分割的mIoU$^\text{ss}$得分达到55.9%,较基线方法VPD提升2.2%;深度估计任务中,我们的方法较VPD获得11.0%的相对增益。