In text-to-image generation tasks, the advancements of diffusion models have facilitated the fidelity of generated results. However, these models encounter challenges when processing text prompts containing multiple entities and attributes. The uneven distribution of attention results in the issues of entity leakage and attribute misalignment. Training from scratch to address this issue requires numerous labeled data and is resource-consuming. Motivated by this, we propose an attribution-focusing mechanism, a training-free phase-wise mechanism by modulation of attention for diffusion model. One of our core ideas is to guide the model to concentrate on the corresponding syntactic components of the prompt at distinct timesteps. To achieve this, we incorporate a temperature control mechanism within the early phases of the self-attention modules to mitigate entity leakage issues. An object-focused masking scheme and a phase-wise dynamic weight control mechanism are integrated into the cross-attention modules, enabling the model to discern the affiliation of semantic information between entities more effectively. The experimental results in various alignment scenarios demonstrate that our model attain better image-text alignment with minimal additional computational cost.
翻译:在文本到图像生成任务中,扩散模型的进步提升了生成结果的保真度。然而,当处理包含多个实体和属性的文本提示时,这些模型面临挑战。注意力分布不均导致实体泄露和属性错位问题。从头开始训练以解决这一问题需要大量标注数据且资源消耗巨大。受此启发,我们提出了一种归因聚焦机制——一种基于注意力调制的无训练分阶段扩散模型机制。其核心思想之一是引导模型在不同时间步聚焦于提示中对应的句法成分。为实现此目标,我们在自注意力模块的早期阶段引入温度控制机制以缓解实体泄露问题。通过将目标聚焦掩码方案与分阶段动态权重控制机制集成到交叉注意力模块中,模型能更有效地区分实体间语义信息的从属关系。在多种对齐场景下的实验结果表明,我们的模型能以极小的额外计算成本实现更优的图像-文本对齐。