Controllable person image generation aims to generate a person image conditioned on reference images, allowing precise control over the person's appearance or pose. However, prior methods often distort fine-grained textural details from the reference image, despite achieving high overall image quality. We attribute these distortions to inadequate attention to corresponding regions in the reference image. To address this, we thereby propose learning flow fields in attention (Leffa), which explicitly guides the target query to attend to the correct reference key in the attention layer during training. Specifically, it is realized via a regularization loss on top of the attention map within a diffusion-based baseline. Our extensive experiments show that Leffa achieves state-of-the-art performance in controlling appearance (virtual try-on) and pose (pose transfer), significantly reducing fine-grained detail distortion while maintaining high image quality. Additionally, we show that our loss is model-agnostic and can be used to improve the performance of other diffusion models.
翻译:可控人物图像生成旨在根据参考图像生成人物图像,实现对人物外观或姿态的精确控制。然而,现有方法尽管能实现较高的整体图像质量,却常常扭曲参考图像中的细粒度纹理细节。我们将这些失真归因于对参考图像中对应区域的注意力机制不足。为此,我们提出注意力流场学习(Leffa)方法,通过在训练过程中显式引导目标查询向量关注注意力层中正确的参考键向量来解决该问题。具体而言,该方法通过在基于扩散的基线模型中对注意力图施加正则化损失来实现。大量实验表明,Leffa在外观控制(虚拟试穿)和姿态控制(姿态迁移)任务中均达到了最先进的性能,在保持高图像质量的同时显著减少了细粒度细节失真。此外,我们证明了所提出的损失函数具有模型无关性,可用于提升其他扩散模型的性能。