ConsistEdit: Highly Consistent and Precise Training-free Visual Editing

Recent advances in training-free attention control methods have enabled flexible and efficient text-guided editing capabilities for existing generation models. However, current approaches struggle to simultaneously deliver strong editing strength while preserving consistency with the source. This limitation becomes particularly critical in multi-round and video editing, where visual errors can accumulate over time. Moreover, most existing methods enforce global consistency, which limits their ability to modify individual attributes such as texture while preserving others, thereby hindering fine-grained editing. Recently, the architectural shift from U-Net to MM-DiT has brought significant improvements in generative performance and introduced a novel mechanism for integrating text and vision modalities. These advancements pave the way for overcoming challenges that previous methods failed to resolve. Through an in-depth analysis of MM-DiT, we identify three key insights into its attention mechanisms. Building on these, we propose ConsistEdit, a novel attention control method specifically tailored for MM-DiT. ConsistEdit incorporates vision-only attention control, mask-guided pre-attention fusion, and differentiated manipulation of the query, key, and value tokens to produce consistent, prompt-aligned edits. Extensive experiments demonstrate that ConsistEdit achieves state-of-the-art performance across a wide range of image and video editing tasks, including both structure-consistent and structure-inconsistent scenarios. Unlike prior methods, it is the first approach to perform editing across all inference steps and attention layers without handcraft, significantly enhancing reliability and consistency, which enables robust multi-round and multi-region editing. Furthermore, it supports progressive adjustment of structural consistency, enabling finer control.

翻译：近年来，无训练注意力控制方法的进展为现有生成模型提供了灵活高效的文本引导编辑能力。然而，当前方法难以在保持编辑强度的同时维持与源图像的一致性。这一局限在多轮编辑和视频编辑中尤为关键，视觉误差会随时间累积。此外，现有方法多强制全局一致性，限制了在保持其他属性的同时修改特定属性（如纹理）的能力，从而阻碍了细粒度编辑。近期，从U-Net到MM-DiT的架构转变显著提升了生成性能，并引入了文本与视觉模态融合的新机制。这些进展为克服先前方法未能解决的挑战铺平了道路。通过对MM-DiT的深入分析，我们揭示了其注意力机制的三项关键特性。基于此，我们提出ConsistEdit——一种专为MM-DiT设计的创新注意力控制方法。ConsistEdit融合了纯视觉注意力控制、掩码引导的预注意力融合，以及对查询、键、值令牌的差异化操作，以生成一致且符合提示的编辑结果。大量实验表明，ConsistEdit在包括结构一致与结构不一致场景在内的广泛图像与视频编辑任务中均达到最先进性能。与先前方法不同，该方法首次实现了在所有推理步骤和注意力层中进行无需手工设计的编辑，显著提升了可靠性与一致性，从而支持稳健的多轮及多区域编辑。此外，该方法支持结构一致性的渐进式调整，实现了更精细的控制。