Diffusion models have demonstrated great success in the field of text-to-image generation. However, alleviating the misalignment between the text prompts and images is still challenging. The root reason behind the misalignment has not been extensively investigated. We observe that the misalignment is caused by inadequate token attention activation. We further attribute this phenomenon to the diffusion model's insufficient condition utilization, which is caused by its training paradigm. To address the issue, we propose CoMat, an end-to-end diffusion model fine-tuning strategy with an image-to-text concept matching mechanism. We leverage an image captioning model to measure image-to-text alignment and guide the diffusion model to revisit ignored tokens. A novel attribute concentration module is also proposed to address the attribute binding problem. Without any image or human preference data, we use only 20K text prompts to fine-tune SDXL to obtain CoMat-SDXL. Extensive experiments show that CoMat-SDXL significantly outperforms the baseline model SDXL in two text-to-image alignment benchmarks and achieves start-of-the-art performance.
翻译:扩散模型在文本到图像生成领域取得了巨大成功,然而缓解文本提示与生成图像之间的错位问题仍然具有挑战性。错位现象的根本原因尚未得到充分研究。我们观察到错位问题源于词元注意力激活不足,并将此现象归因于扩散模型训练范式导致的约束条件利用不充分。为解决该问题,我们提出CoMat——一种结合图文概念匹配机制的端到端扩散模型微调策略。我们利用图像描述模型度量图文对齐程度,并引导扩散模型重新关注被忽略的词元。同时提出新颖的属性聚合模块以解决属性绑定问题。在不使用任何图像或人类偏好数据的情况下,我们仅用2万条文本提示对SDXL进行微调得到CoMat-SDXL。大量实验表明,CoMat-SDXL在两个文本到图像对齐基准测试中显著优于基线模型SDXL,并取得了最先进的性能。