The rapid advancement of large language models (LLMs) and multimodal learning has transformed digital content creation and manipulation. Traditional visual editing tools require significant expertise, limiting accessibility. Recent strides in instruction-based editing have enabled intuitive interaction with visual content, using natural language as a bridge between user intent and complex editing operations. This survey provides an overview of these techniques, focusing on how LLMs and multimodal models empower users to achieve precise visual modifications without deep technical knowledge. By synthesizing over 100 publications, we explore methods from generative adversarial networks to diffusion models, examining multimodal integration for fine-grained content control. We discuss practical applications across domains such as fashion, 3D scene manipulation, and video synthesis, highlighting increased accessibility and alignment with human intuition. Our survey compares existing literature, emphasizing LLM-empowered editing, and identifies key challenges to stimulate further research. We aim to democratize powerful visual editing across various industries, from entertainment to education. Interested readers are encouraged to access our repository at https://github.com/tamlhp/awesome-instruction-editing.
翻译:大型语言模型(LLM)与多模态学习的快速发展正在变革数字内容的创建与操控。传统的视觉编辑工具需要较高的专业技能,限制了其可及性。近年来,基于指令的编辑技术取得了显著进展,它利用自然语言作为用户意图与复杂编辑操作之间的桥梁,实现了与视觉内容的直观交互。本综述系统概述了这些技术,重点探讨了LLM与多模态模型如何赋能用户,使其无需深厚技术知识即可实现精确的视觉修改。通过综合梳理超过100篇相关文献,我们探索了从生成对抗网络到扩散模型等多种方法,并审视了用于细粒度内容控制的多模态集成技术。我们讨论了这些技术在时尚、3D场景操控及视频合成等多个领域的实际应用,强调了其在提升可及性及与人类直觉对齐方面的进展。本综述对比了现有文献,着重分析了LLM赋能的编辑方法,并指出了关键挑战以激发进一步研究。我们的目标是在从娱乐到教育等各个行业,普及强大的视觉编辑能力。感兴趣的读者可通过 https://github.com/tamlhp/awesome-instruction-editing 访问我们的资源库。