The rapid advancement of large language models (LLMs) and multimodal learning has transformed digital content creation and manipulation. Traditional visual editing tools require significant expertise, limiting accessibility. Recent strides in instruction-based editing have enabled intuitive interaction with visual content, using natural language as a bridge between user intent and complex editing operations. This survey provides an overview of these techniques, focusing on how LLMs and multimodal models empower users to achieve precise visual modifications without deep technical knowledge. By synthesizing over 100 publications, we explore methods from generative adversarial networks to diffusion models, examining multimodal integration for fine-grained content control. We discuss practical applications across domains such as fashion, 3D scene manipulation, and video synthesis, highlighting increased accessibility and alignment with human intuition. Our survey compares existing literature, emphasizing LLM-empowered editing, and identifies key challenges to stimulate further research. We aim to democratize powerful visual editing across various industries, from entertainment to education. Interested readers are encouraged to access our repository at https://github.com/tamlhp/awesome-instruction-editing.
翻译:大型语言模型(LLM)与多模态学习的快速发展正在改变数字内容的创建与操控方式。传统的视觉编辑工具需要大量专业知识,限制了其可及性。近期基于指令的编辑技术取得了显著进展,使得用户能够通过自然语言与视觉内容进行直观交互,从而在用户意图与复杂编辑操作之间架起桥梁。本综述系统梳理了这些技术,重点探讨LLM与多模态模型如何赋能用户在不具备深厚技术知识的情况下实现精确的视觉修改。通过综合分析百余篇文献,我们探索了从生成对抗网络到扩散模型等多种方法,并考察了多模态整合在细粒度内容控制中的应用。我们讨论了时尚设计、三维场景操控、视频合成等领域的实际应用,强调了技术可及性的提升及其与人类直觉的契合。本综述对比了现有文献,着重分析了LLM赋能的编辑技术,并指出了关键挑战以推动进一步研究。我们的目标是在娱乐、教育等各行业普及强大的视觉编辑能力。感兴趣的读者可通过 https://github.com/tamlhp/awesome-instruction-editing 访问我们的资源库。