Despite advancements in text-to-image generation (T2I), prior methods often face text-image misalignment problems such as relation confusion in generated images. Existing solutions involve cross-attention manipulation for better compositional understanding or integrating large language models for improved layout planning. However, the inherent alignment capabilities of T2I models are still inadequate. By reviewing the link between generative and discriminative modeling, we posit that T2I models' discriminative abilities may reflect their text-image alignment proficiency during generation. In this light, we advocate bolstering the discriminative abilities of T2I models to achieve more precise text-to-image alignment for generation. We present a discriminative adapter built on T2I models to probe their discriminative abilities on two representative tasks and leverage discriminative fine-tuning to improve their text-image alignment. As a bonus of the discriminative adapter, a self-correction mechanism can leverage discriminative gradients to better align generated images to text prompts during inference. Comprehensive evaluations across three benchmark datasets, including both in-distribution and out-of-distribution scenarios, demonstrate our method's superior generation performance. Meanwhile, it achieves state-of-the-art discriminative performance on the two discriminative tasks compared to other generative models.
翻译:尽管文本到图像生成(T2I)技术取得了进展,现有方法在生成图像中仍常面临文本-图像错配问题,如关系混淆。现有解决方案涉及交叉注意力操控以增强组合理解能力,或集成大语言模型以改进布局规划。然而,T2I模型的固有对齐能力仍显不足。通过重新审视生成式建模与判别式建模之间的联系,我们提出T2I模型的判别能力可能反映其生成过程中文本-图像对齐的熟练度。基于此,我们主张增强T2I模型的判别能力以实现更精确的文本到图像对齐生成。我们构建了基于T2I模型的判别式适配器,通过两项代表性任务探测其判别能力,并利用判别式微调改进文本-图像对齐。作为判别式适配器的额外收益,自校正机制可在推理阶段利用判别式梯度使生成图像与文本提示更好地对齐。在三个基准数据集(涵盖分布内与分布外场景)上的全面评估表明,我们的方法具有优越的生成性能。同时,与其他生成模型相比,该方法在两个判别任务上达到了最先进的判别表现。