The issue of generative pretraining for vision models has persisted as a long-standing conundrum. At present, the text-to-image (T2I) diffusion model demonstrates remarkable proficiency in generating high-definition images matching textual inputs, a feat made possible through its pre-training on large-scale image-text pairs. This leads to a natural inquiry: can diffusion models be utilized to tackle visual perception tasks? In this paper, we propose a simple yet effective scheme to harness a diffusion model for visual perception tasks. Our key insight is to introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception. The effect of meta prompts are two-fold. First, as a direct replacement of the text embeddings in the T2I models, it can activate task-relevant features during feature extraction. Second, it will be used to re-arrange the extracted features to ensures that the model focuses on the most pertinent features for the task on hand. Additionally, we design a recurrent refinement training strategy that fully leverages the property of diffusion models, thereby yielding stronger visual features. Extensive experiments across various benchmarks validate the effectiveness of our approach. Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes. Concurrently, the proposed method attains results comparable to the current state-of-the-art in semantic segmentation on ADE20K and pose estimation on COCO datasets, further exemplifying its robustness and versatility.
翻译:视觉模型的生成式预训练问题长期以来一直是一个难题。目前,文本到图像(T2I)扩散模型展现出在生成与文本输入匹配的高清图像方面的卓越能力,这得益于其在大规模图像-文本对上的预训练。这自然引出一个问题:扩散模型能否用于解决视觉感知任务?在本文中,我们提出了一种简单而有效的方案,利用扩散模型执行视觉感知任务。我们的关键见解是向预训练扩散模型中引入可学习嵌入(元提示),以提取适合感知的特征。元提示的作用体现在两个方面。首先,作为T2I模型中文本嵌入的直接替代,它能在特征提取过程中激活与任务相关的特征。其次,它将被用于重新排列所提取的特征,确保模型聚焦于当前任务最相关的特征。此外,我们设计了一种循环精炼训练策略,充分利用扩散模型的特性,从而生成更强的视觉特征。跨多个基准的大量实验验证了我们方法的有效性。我们的方法在NYU Depth V2和KITTI的深度估计任务以及CityScapes的语义分割任务中达到了新的性能记录。同时,所提方法在ADE20K的语义分割和COCO数据集上的姿态估计任务中取得了与当前最先进方法相当的结果,进一步证明了其鲁棒性和通用性。