Diffusion models have demonstrated great success in the field of text-to-image generation. However, alleviating the misalignment between the text prompts and images is still challenging. The root reason behind the misalignment has not been extensively investigated. We observe that the misalignment is caused by inadequate token attention activation. We further attribute this phenomenon to the diffusion model's insufficient condition utilization, which is caused by its training paradigm. To address the issue, we propose CoMat, an end-to-end diffusion model fine-tuning strategy with an image-to-text concept matching mechanism. We leverage an image captioning model to measure image-to-text alignment and guide the diffusion model to revisit ignored tokens. A novel attribute concentration module is also proposed to address the attribute binding problem. Without any image or human preference data, we use only 20K text prompts to fine-tune SDXL to obtain CoMat-SDXL. Extensive experiments show that CoMat-SDXL significantly outperforms the baseline model SDXL in two text-to-image alignment benchmarks and achieves start-of-the-art performance.
翻译:摘要:扩散模型在文本到图像生成领域取得了巨大成功。然而,缓解文本提示与图像之间的不匹配问题仍然具有挑战性。这种不匹配背后的根本原因尚未得到广泛研究。我们观察到,不匹配是由标记注意力激活不足引起的。我们进一步将这一现象归因于扩散模型训练范式导致的条件利用不充分。为了解决这一问题,我们提出了CoMat,这是一种端到端的扩散模型微调策略,并采用了图文概念匹配机制。我们利用图像描述模型衡量图文对齐程度,并引导扩散模型重新关注被忽略的标记。针对属性绑定问题,我们还提出了一种新颖的属性集中模块。无需任何图像或人工偏好数据,我们仅使用20K文本提示微调SDXL,得到了CoMat-SDXL。大量实验表明,在两个文本到图像对齐基准测试中,CoMat-SDXL显著优于基线模型SDXL,并达到了最先进的性能。