This paper strives for image editing via generative models. Flow Matching is an emerging generative modeling technique that offers the advantage of simple and efficient training. Simultaneously, a new transformer-based U-ViT has recently been proposed to replace the commonly used UNet for better scalability and performance in generative modeling. Hence, Flow Matching with a transformer backbone offers the potential for scalable and high-quality generative modeling, but their latent structure and editing ability are as of yet unknown. Hence, we adopt this setting and explore how to edit images through latent space manipulation. We introduce an editing space, which we call $u$-space, that can be manipulated in a controllable, accumulative, and composable manner. Additionally, we propose a tailored sampling solution to enable sampling with the more efficient adaptive step-size ODE solvers. Lastly, we put forth a straightforward yet powerful method for achieving fine-grained and nuanced editing using text prompts. Our framework is simple and efficient, all while being highly effective at editing images while preserving the essence of the original content. Our code will be publicly available at https://taohu.me/lfm/
翻译:本文致力于通过生成模型实现图像编辑。流匹配是一种新兴的生成建模技术,具有训练简单高效的优势。与此同时,近年来提出的新型基于Transformer的U-ViT架构替代了常用的UNet,在生成建模中展现出更优的扩展性和性能。因此,基于Transformer骨干网络的流匹配有望实现可扩展且高质量的生成建模,但其潜在结构及编辑能力尚不明确。为此,我们采用该框架探索通过潜在空间操作实现图像编辑的方法。我们引入了一种可编辑空间——称为$u$-空间——该空间支持可控、可累积且可组合的操作。此外,我们提出了一种定制的采样方案,能够利用更高效的自适应步长ODE求解器进行采样。最后,我们提出一种简洁而强大的方法,通过文本提示实现精细且微妙的编辑。我们的框架简单高效,在保持原始内容本质的同时,能够高效完成图像编辑。相关代码将在https://taohu.me/lfm/ 公开提供。