Diffusion models have revolutionized the field of content synthesis and editing. Recent models have replaced the traditional UNet architecture with the Diffusion Transformer (DiT), and employed flow-matching for improved training and sampling. However, they exhibit limited generation diversity. In this work, we leverage this limitation to perform consistent image edits via selective injection of attention features. The main challenge is that, unlike the UNet-based models, DiT lacks a coarse-to-fine synthesis structure, making it unclear in which layers to perform the injection. Therefore, we propose an automatic method to identify "vital layers" within DiT, crucial for image formation, and demonstrate how these layers facilitate a range of controlled stable edits, from non-rigid modifications to object addition, using the same mechanism. Next, to enable real-image editing, we introduce an improved image inversion method for flow models. Finally, we evaluate our approach through qualitative and quantitative comparisons, along with a user study, and demonstrate its effectiveness across multiple applications. The project page is available at https://omriavrahami.com/stable-flow
翻译:扩散模型已彻底革新了内容合成与编辑领域。近期模型采用扩散Transformer(DiT)替代了传统的UNet架构,并运用流匹配技术以改进训练和采样过程。然而,这些模型在生成多样性方面存在局限。本研究利用这一局限性,通过选择性注入注意力特征来实现一致的图像编辑。主要挑战在于:与基于UNet的模型不同,DiT缺乏从粗到细的合成结构,导致难以确定在哪些网络层进行特征注入。为此,我们提出一种自动识别DiT中"关键层"的方法,这些层对图像形成至关重要,并演示了如何利用同一机制通过这些层实现从非刚性修改到对象添加的一系列可控稳定编辑。随后,为实现真实图像编辑,我们针对流模型提出了一种改进的图像反演方法。最后,通过定性与定量比较及用户研究评估了本方法的有效性,并在多个应用场景中验证了其性能。项目页面详见:https://omriavrahami.com/stable-flow