Text-guided image editing is an essential task that enables users to modify images through natural language descriptions. Recent advances in diffusion models and rectified flows have significantly improved editing quality, primarily relying on inversion techniques to extract structured noise from input images. However, inaccuracies in inversion can propagate errors, leading to unintended modifications and compromising fidelity. Moreover, even with perfect inversion, the entanglement between textual prompts and image features often results in global changes when only local edits are intended. To address these challenges, we propose a novel text-guided image editing framework based on VAR (Visual AutoRegressive modeling), which eliminates the need for explicit inversion while ensuring precise and controlled modifications. Our method introduces a caching mechanism that stores token indices and probability distributions from the original image, capturing the relationship between the source prompt and the image. Using this cache, we design an adaptive fine-grained masking strategy that dynamically identifies and constrains modifications to relevant regions, preventing unintended changes. A token reassembling approach further refines the editing process, enhancing diversity, fidelity, and control. Our framework operates in a training-free manner and achieves high-fidelity editing with faster inference speeds, processing a 1K resolution image in as fast as 1.2 seconds. Extensive experiments demonstrate that our method achieves performance comparable to, or even surpassing, existing diffusion- and rectified flow-based approaches in both quantitative metrics and visual quality. The code will be released.
翻译:文本引导图像编辑是一项关键任务,它允许用户通过自然语言描述来修改图像。扩散模型和修正流的最新进展显著提升了编辑质量,主要依赖于反演技术从输入图像中提取结构化噪声。然而,反演过程中的不准确性可能导致误差传播,引发非预期的修改并损害保真度。此外,即使实现完美反演,文本提示与图像特征之间的纠缠也常常导致在仅需局部编辑时产生全局性变化。为应对这些挑战,我们提出了一种基于VAR(视觉自回归建模)的新型文本引导图像编辑框架,该框架无需显式反演即可确保精确且可控的修改。我们的方法引入了一种缓存机制,用于存储原始图像的词元索引和概率分布,从而捕获源提示与图像之间的关联关系。利用该缓存,我们设计了一种自适应细粒度掩码策略,能够动态识别并约束对相关区域的修改,防止非预期变化。进一步通过词元重组方法优化编辑过程,提升多样性、保真度与控制能力。本框架以无需训练的方式运行,在实现高保真编辑的同时具有更快的推理速度,处理1K分辨率图像最快仅需1.2秒。大量实验表明,我们的方法在定量指标和视觉质量上均达到甚至超越了现有基于扩散模型和修正流的方法的性能。代码将公开发布。