Adding Object into images based on text instructions is a challenging task in semantic image editing, requiring a balance between preserving the original scene and seamlessly integrating the new object in a fitting location. Despite extensive efforts, existing models often struggle with this balance, particularly with finding a natural location for adding an object in complex scenes. We introduce Add-it, a training-free approach that extends diffusion models' attention mechanisms to incorporate information from three key sources: the scene image, the text prompt, and the generated image itself. Our weighted extended-attention mechanism maintains structural consistency and fine details while ensuring natural object placement. Without task-specific fine-tuning, Add-it achieves state-of-the-art results on both real and generated image insertion benchmarks, including our newly constructed "Additing Affordance Benchmark" for evaluating object placement plausibility, outperforming supervised methods. Human evaluations show that Add-it is preferred in over 80% of cases, and it also demonstrates improvements in various automated metrics.
翻译:基于文本指令在图像中添加物体是语义图像编辑中的一项挑战性任务,需要在保持原始场景与将新物体无缝融入合适位置之间取得平衡。尽管已有大量研究,现有模型仍难以实现这种平衡,尤其是在复杂场景中为添加物体寻找自然位置方面。本文提出Add-it,一种免训练方法,通过扩展扩散模型的注意力机制以整合来自三个关键来源的信息:场景图像、文本提示及生成图像本身。我们的加权扩展注意力机制在确保物体自然放置的同时,保持了结构一致性与细节精度。无需任务特定微调,Add-it在真实图像与生成图像的物体插入基准测试中均取得了最先进的结果,包括我们新构建的用于评估物体放置合理性的“Additing Affordance Benchmark”,其性能甚至超越了有监督方法。人工评估表明Add-it在超过80%的案例中更受青睐,同时在各项自动化指标上也展现出显著提升。