PrimeComposer: Faster Progressively Combined Diffusion for Image Composition with Attention Steering

Image composition involves seamlessly integrating given objects into a specific visual context. The current training-free methods rely on composing attention weights from several samplers to guide the generator. However, since these weights are derived from disparate contexts, their combination leads to coherence confusion in synthesis and loss of appearance information. These issues worsen with their excessive focus on background generation, even when unnecessary in this task. This not only slows down inference but also compromises foreground generation quality. Moreover, these methods introduce unwanted artifacts in the transition area. In this paper, we formulate image composition as a subject-based local editing task, solely focusing on foreground generation. At each step, the edited foreground is combined with the noisy background to maintain scene consistency. To address the remaining issues, we propose PrimeComposer, a faster training-free diffuser that composites the images by well-designed attention steering across different noise levels. This steering is predominantly achieved by our Correlation Diffuser, utilizing its self-attention layers at each step. Within these layers, the synthesized subject interacts with both the referenced object and background, capturing intricate details and coherent relationships. This prior information is encoded into the attention weights, which are then integrated into the self-attention layers of the generator to guide the synthesis process. Besides, we introduce a Region-constrained Cross-Attention to confine the impact of specific subject-related words to desired regions, addressing the unwanted artifacts shown in the prior method thereby further improving the coherence in the transition area. Our method exhibits the fastest inference efficiency and extensive experiments demonstrate our superiority both qualitatively and quantitatively.

翻译：图像合成旨在将给定对象无缝融入特定视觉环境。当前无训练方法依赖组合来自多个采样器的注意力权重来引导生成器。然而，由于这些权重源自不同上下文，其组合会导致合成中的连贯性混乱及外观信息丢失。此类问题因方法对背景生成的过度关注而加剧（即便任务中并无必要），这不仅拖慢推理速度，还损害前景生成质量。此外，这些方法会在过渡区域引入伪影。本文中，我们将图像合成定义为基于主体的局部编辑任务，仅聚焦于前景生成。在每一步中，编辑后的前景与含噪背景相结合以维持场景一致性。为解决遗留问题，我们提出PrimeComposer——一种更快速的无训练扩散模型，通过跨不同噪声水平的精心设计的注意力引导实现图像合成。该引导主要通过我们的"相关扩散器"（Correlation Diffuser）实现，其在每一步中利用自注意力层。在这些层内，合成主体与参考对象及背景交互，捕获细节信息与连贯关系。此类先验信息被编码为注意力权重，随后集成到生成器的自注意力层中以指导合成过程。此外，我们引入区域约束交叉注意力（Region-constrained Cross-Attention），将特定主体相关词语的影响限制在预期区域内，从而缓解先前方法中的伪影问题，进一步提升过渡区域的连贯性。我们的方法展现出最快的推理效率，大量实验在定性与定量层面均证明了其优越性。