Image editing aims to edit the given synthetic or real image to meet the specific requirements from users. It is widely studied in recent years as a promising and challenging field of Artificial Intelligence Generative Content (AIGC). Recent significant advancement in this field is based on the development of text-to-image (T2I) diffusion models, which generate images according to text prompts. These models demonstrate remarkable generative capabilities and have become widely used tools for image editing. T2I-based image editing methods significantly enhance editing performance and offer a user-friendly interface for modifying content guided by multimodal inputs. In this survey, we provide a comprehensive review of multimodal-guided image editing techniques that leverage T2I diffusion models. First, we define the scope of image editing from a holistic perspective and detail various control signals and editing scenarios. We then propose a unified framework to formalize the editing process, categorizing it into two primary algorithm families. This framework offers a design space for users to achieve specific goals. Subsequently, we present an in-depth analysis of each component within this framework, examining the characteristics and applicable scenarios of different combinations. Given that training-based methods learn to directly map the source image to target one under user guidance, we discuss them separately, and introduce injection schemes of source image in different scenarios. Additionally, we review the application of 2D techniques to video editing, highlighting solutions for inter-frame inconsistency. Finally, we discuss open challenges in the field and suggest potential future research directions. We keep tracing related works at https://github.com/xinchengshuai/Awesome-Image-Editing.
翻译:图像编辑旨在根据用户特定需求对给定的合成或真实图像进行修改。作为人工智能生成内容(AIGC)领域中一个前景广阔且充满挑战的方向,该领域近年来受到广泛研究。该领域近期的重要进展建立在文本到图像(T2I)扩散模型的发展之上,这些模型能够根据文本提示生成图像。这些模型展现出卓越的生成能力,已成为广泛使用的图像编辑工具。基于T2I的图像编辑方法显著提升了编辑性能,并为多模态输入引导的内容修改提供了用户友好的界面。本文对利用T2I扩散模型的多模态引导图像编辑技术进行了全面综述。首先,我们从整体视角界定图像编辑的范畴,并详细阐述各类控制信号与编辑场景。随后,我们提出一个统一框架来形式化编辑过程,将其归纳为两个主要算法家族。该框架为用户实现特定目标提供了设计空间。接着,我们对该框架内的各个组成部分进行深入分析,探讨不同组合的特性与适用场景。鉴于基于训练的方法是在用户引导下学习从源图像到目标图像的直接映射,我们对其进行单独讨论,并介绍不同场景下源图像的注入方案。此外,我们回顾了2D技术在视频编辑中的应用,重点介绍了解决帧间不一致性的方案。最后,我们讨论了该领域的开放挑战,并提出了潜在的未来研究方向。相关工作的持续追踪请访问:https://github.com/xinchengshuai/Awesome-Image-Editing。