Diffusion models, such as Stable Diffusion, have shown incredible performance on text-to-image generation. Since text-to-image generation often requires models to generate visual concepts with fine-grained details and attributes specified in text prompts, can we leverage the powerful representations learned by pre-trained diffusion models for discriminative tasks such as image-text matching? To answer this question, we propose a novel approach, Discriminative Stable Diffusion (DSD), which turns pre-trained text-to-image diffusion models into few-shot discriminative learners. Our approach mainly uses the cross-attention score of a Stable Diffusion model to capture the mutual influence between visual and textual information and fine-tune the model via efficient attention-based prompt learning to perform image-text matching. By comparing DSD with state-of-the-art methods on several benchmark datasets, we demonstrate the potential of using pre-trained diffusion models for discriminative tasks with superior results on few-shot image-text matching.
翻译:扩散模型(如Stable Diffusion)在文本到图像生成中展现了卓越性能。由于文本到图像生成通常要求模型根据文本提示生成具有细粒度细节和属性的视觉概念,我们不禁思考:能否利用预训练扩散模型学到的强大表征来执行判别性任务(如图像-文本匹配)?为回答这一问题,我们提出了一种新方法——判别式稳定扩散(Discriminative Stable Diffusion, DSD),该方法将预训练的文本到图像扩散模型转变为小样本判别式学习器。我们的方法主要利用Stable Diffusion模型的交叉注意力分数捕捉视觉与文本信息之间的相互影响,并通过基于注意力机制的高效提示学习对模型进行微调,以执行图像-文本匹配。通过在多个基准数据集上比较DSD与现有最优方法,我们证明了将预训练扩散模型用于判别性任务的潜力,并在小样本图像-文本匹配中取得了优越结果。