Implicit and Explicit Language Guidance for Diffusion-based Visual Perception

Text-to-image diffusion models have shown powerful ability on conditional image synthesis. With large-scale vision-language pre-training, diffusion models are able to generate high-quality images with rich texture and reasonable structure under different text prompts. However, it is an open problem to adapt the pre-trained diffusion model for visual perception. In this paper, we propose an implicit and explicit language guidance framework for diffusion-based perception, named IEDP. Our IEDP comprises an implicit language guidance branch and an explicit language guidance branch. The implicit branch employs frozen CLIP image encoder to directly generate implicit text embeddings that are fed to diffusion model, without using explicit text prompts. The explicit branch utilizes the ground-truth labels of corresponding images as text prompts to condition feature extraction of diffusion model. During training, we jointly train diffusion model by sharing the model weights of these two branches. As a result, implicit and explicit branches can jointly guide feature learning. During inference, we only employ implicit branch for final prediction, which does not require any ground-truth labels. Experiments are performed on two typical perception tasks, including semantic segmentation and depth estimation. Our IEDP achieves promising performance on both tasks. For semantic segmentation, our IEDP has the mIoU$^\text{ss}$ score of 55.9% on AD20K validation set, which outperforms the baseline method VPD by 2.2%. For depth estimation, our IEDP outperforms the baseline method VPD with a relative gain of 11.0%.

翻译：文本到图像扩散模型在条件图像合成方面展现出强大能力。通过大规模视觉-语言预训练，扩散模型能够根据不同文本提示生成具有丰富纹理和合理结构的高质量图像。然而，如何将预训练的扩散模型适配于视觉感知任务仍是一个开放性问题。本文提出一种用于基于扩散的视觉感知的隐式与显式语言引导框架，命名为IEDP。我们的IEDP包含一个隐式语言引导分支和一个显式语言引导分支。隐式分支采用冻结的CLIP图像编码器直接生成隐式文本嵌入，并输入扩散模型，无需使用显式文本提示。显式分支利用对应图像的真实标签作为文本提示，以条件化扩散模型的特征提取。在训练过程中，我们通过共享这两个分支的模型权重来联合训练扩散模型。因此，隐式与显式分支能够共同指导特征学习。在推理阶段，我们仅使用隐式分支进行最终预测，无需任何真实标签。实验在两个典型的感知任务上进行，包括语义分割和深度估计。我们的IEDP在这两个任务上均取得了有希望的性能。对于语义分割，我们的IEDP在AD20K验证集上的mIoU$^\text{ss}$得分为55.9%，优于基线方法VPD 2.2%。对于深度估计，我们的IEDP以11.0%的相对增益优于基线方法VPD。