Training-Free Text-Guided Image Editing with Visual Autoregressive Model

Text-guided image editing is an essential task that enables users to modify images through natural language descriptions. Recent advances in diffusion models and rectified flows have significantly improved editing quality, primarily relying on inversion techniques to extract structured noise from input images. However, inaccuracies in inversion can propagate errors, leading to unintended modifications and compromising fidelity. Moreover, even with perfect inversion, the entanglement between textual prompts and image features often results in global changes when only local edits are intended. To address these challenges, we propose a novel text-guided image editing framework based on VAR (Visual AutoRegressive modeling), which eliminates the need for explicit inversion while ensuring precise and controlled modifications. Our method introduces a caching mechanism that stores token indices and probability distributions from the original image, capturing the relationship between the source prompt and the image. Using this cache, we design an adaptive fine-grained masking strategy that dynamically identifies and constrains modifications to relevant regions, preventing unintended changes. A token reassembling approach further refines the editing process, enhancing diversity, fidelity, and control. Our framework operates in a training-free manner and achieves high-fidelity editing with faster inference speeds, processing a 1K resolution image in as fast as 1.2 seconds. Extensive experiments demonstrate that our method achieves performance comparable to, or even surpassing, existing diffusion- and rectified flow-based approaches in both quantitative metrics and visual quality. The code will be released.

翻译：文本引导图像编辑是一项关键任务，它允许用户通过自然语言描述来修改图像。扩散模型和修正流的最新进展显著提升了编辑质量，主要依赖于反演技术从输入图像中提取结构化噪声。然而，反演过程中的不准确性可能导致误差传播，引发非预期的修改并损害保真度。此外，即使实现完美反演，文本提示与图像特征之间的纠缠也常常导致在仅需局部编辑时产生全局性变化。为应对这些挑战，我们提出了一种基于VAR（视觉自回归建模）的新型文本引导图像编辑框架，该框架无需显式反演即可确保精确且可控的修改。我们的方法引入了一种缓存机制，用于存储原始图像的词元索引和概率分布，从而捕获源提示与图像之间的关联关系。利用该缓存，我们设计了一种自适应细粒度掩码策略，能够动态识别并约束对相关区域的修改，防止非预期变化。进一步通过词元重组方法优化编辑过程，提升多样性、保真度与控制能力。本框架以无需训练的方式运行，在实现高保真编辑的同时具有更快的推理速度，处理1K分辨率图像最快仅需1.2秒。大量实验表明，我们的方法在定量指标和视觉质量上均达到甚至超越了现有基于扩散模型和修正流的方法的性能。代码将公开发布。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日