Vanilla text-to-image diffusion models struggle with generating accurate human images, commonly resulting in imperfect anatomies such as unnatural postures or disproportionate limbs.Existing methods address this issue mostly by fine-tuning the model with extra images or adding additional controls -- human-centric priors such as pose or depth maps -- during the image generation phase. This paper explores the integration of these human-centric priors directly into the model fine-tuning stage, essentially eliminating the need for extra conditions at the inference stage. We realize this idea by proposing a human-centric alignment loss to strengthen human-related information from the textual prompts within the cross-attention maps. To ensure semantic detail richness and human structural accuracy during fine-tuning, we introduce scale-aware and step-wise constraints within the diffusion process, according to an in-depth analysis of the cross-attention layer. Extensive experiments show that our method largely improves over state-of-the-art text-to-image models to synthesize high-quality human images based on user-written prompts. Project page: \url{https://hcplayercvpr2024.github.io}.
翻译:原始文本到图像扩散模型在生成准确的人体图像时面临挑战,常出现不自然的姿态或肢体比例失调等解剖学缺陷。现有方法主要通过额外图像微调模型,或在图像生成阶段引入人体姿态、深度图等人类中心先验作为附加控制来解决该问题。本文探索将这些人类中心先验直接集成到模型微调阶段,从而在推理阶段无需额外条件。我们通过提出人类中心对齐损失函数来实现这一思路,该损失函数可增强跨注意力机制中文本提示所蕴含的人体相关信息。为在微调阶段保证语义细节丰富度和人体结构准确性,我们基于对跨注意力层的深入分析,在扩散过程中引入尺度感知约束和步进式约束。大量实验表明,本方法在基于用户文本提示合成高质量人体图像方面显著优于现有最先进的文本到图像模型。项目页面:\url{https://hcplayercvpr2024.github.io}。