The excellent generative capabilities of text-to-image diffusion models suggest they learn informative representations of image-text data. However, what knowledge their representations capture is not fully understood, and they have not been thoroughly explored on downstream tasks. We investigate diffusion models by proposing a method for evaluating them as zero-shot classifiers. The key idea is using a diffusion model's ability to denoise a noised image given a text description of a label as a proxy for that label's likelihood. We apply our method to Imagen, using it to probe fine-grained aspects of Imagen's knowledge and comparing it with CLIP's zero-shot abilities. Imagen performs competitively with CLIP on a wide range of zero-shot image classification datasets. Additionally, it achieves state-of-the-art results on shape/texture bias tests and can successfully perform attribute binding while CLIP cannot. Although generative pre-training is prevalent in NLP, visual foundation models often use other methods such as contrastive learning. Based on our findings, we argue that generative pre-training should be explored as a compelling alternative for vision and vision-language problems.
翻译:文本到图像扩散模型卓越的生成能力表明,它们学习了图像-文本数据中信息丰富的表征。然而,其表征所捕获的知识尚未被完全理解,也未在下游任务中得到深入探索。我们通过提出一种将扩散模型作为零样本分类器进行评估的方法来研究扩散模型。关键思想是:利用扩散模型在给定标签文本描述时对加噪图像进行去噪的能力,作为该标签可能性的代理指标。我们将该方法应用于Imagen模型,用以探究Imagen知识的细粒度方面,并将其与CLIP的零样本能力进行比较。Imagen在广泛的零样本图像分类数据集上达到与CLIP相当的性能。此外,它在形状/纹理偏置测试上取得了最先进的结果,并能成功执行属性绑定,而CLIP无法做到这一点。尽管生成式预训练在自然语言处理领域普遍存在,但视觉基础模型通常采用对比学习等其他方法。基于我们的发现,我们认为生成式预训练应作为视觉和视觉-语言问题中一个极具吸引力的替代方案加以探索。