A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models

Image editing aims to edit the given synthetic or real image to meet the specific requirements from users. It is widely studied in recent years as a promising and challenging field of Artificial Intelligence Generative Content (AIGC). Recent significant advancement in this field is based on the development of text-to-image (T2I) diffusion models, which generate images according to text prompts. These models demonstrate remarkable generative capabilities and have become widely used tools for image editing. T2I-based image editing methods significantly enhance editing performance and offer a user-friendly interface for modifying content guided by multimodal inputs. In this survey, we provide a comprehensive review of multimodal-guided image editing techniques that leverage T2I diffusion models. First, we define the scope of image editing from a holistic perspective and detail various control signals and editing scenarios. We then propose a unified framework to formalize the editing process, categorizing it into two primary algorithm families. This framework offers a design space for users to achieve specific goals. Subsequently, we present an in-depth analysis of each component within this framework, examining the characteristics and applicable scenarios of different combinations. Given that training-based methods learn to directly map the source image to target one under user guidance, we discuss them separately, and introduce injection schemes of source image in different scenarios. Additionally, we review the application of 2D techniques to video editing, highlighting solutions for inter-frame inconsistency. Finally, we discuss open challenges in the field and suggest potential future research directions. We keep tracing related works at https://github.com/xinchengshuai/Awesome-Image-Editing.

翻译：图像编辑旨在根据用户特定需求对给定的合成或真实图像进行修改。作为人工智能生成内容（AIGC）领域中一个前景广阔且充满挑战的方向，该领域近年来受到广泛研究。该领域近期的重要进展建立在文本到图像（T2I）扩散模型的发展之上，这些模型能够根据文本提示生成图像。这些模型展现出卓越的生成能力，已成为广泛使用的图像编辑工具。基于T2I的图像编辑方法显著提升了编辑性能，并为多模态输入引导的内容修改提供了用户友好的界面。本文对利用T2I扩散模型的多模态引导图像编辑技术进行了全面综述。首先，我们从整体视角界定图像编辑的范畴，并详细阐述各类控制信号与编辑场景。随后，我们提出一个统一框架来形式化编辑过程，将其归纳为两个主要算法家族。该框架为用户实现特定目标提供了设计空间。接着，我们对该框架内的各个组成部分进行深入分析，探讨不同组合的特性与适用场景。鉴于基于训练的方法是在用户引导下学习从源图像到目标图像的直接映射，我们对其进行单独讨论，并介绍不同场景下源图像的注入方案。此外，我们回顾了2D技术在视频编辑中的应用，重点介绍了解决帧间不一致性的方案。最后，我们讨论了该领域的开放挑战，并提出了潜在的未来研究方向。相关工作的持续追踪请访问：https://github.com/xinchengshuai/Awesome-Image-Editing。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日