Multi-turn image editing is essential for iterative design, yet current models often struggle with identity drift and error accumulation over successive steps. While existing research leverages video priors for consistency, their reliance on bidirectional attention is fundamentally misaligned with the causal, sequential nature of interactive editing. In this paper, we propose AnchorEdit, the first autoregressive (AR) diffusion-based framework designed specifically for high-resolution, long-term multi-turn editing. AnchorEdit bridges the gap between video priors and causal inference through a three-stage training curriculum: identity-preserving sing-turn pretraining, causal AR forcing fine-tuning with a novel self-rollout strategy to mitigate exposure bias, and consistency distillation for efficient 4-step generation. During inference, we introduce a memory mechanism to anchor the initial subject identity and ensure stable extrapolation across extended editing trajectories. To evaluate performance, we provide a new high-resolution multi-turn editing benchmark designed to stress-test long-horizon stability. Extensive experiments demonstrate that AnchorEdit achieves state-of-the-art results, maintaining exceptional subject fidelity and instruction following even over 10+ interaction rounds.
翻译:多轮图像编辑是迭代设计的核心环节,但现有模型往往面临连续步骤中的身份漂移与误差累积问题。尽管现有方法利用视频先验保持一致性,但其依赖的双向注意力机制与交互式编辑的因果时序特性根本性错位。本文提出AnchorEdit,首个专为高分辨率、长时序多轮编辑设计的自回归扩散框架。AnchorEdit通过三阶段训练流程弥合视频先验与因果推理的鸿沟:保持身份的单轮预训练、采用新型自滚动策略缓解曝光偏差的因果自回归强制微调,以及面向四步高效生成的一致性蒸馏。在推理阶段,我们引入记忆机制锚定初始主体身份,确保在扩展编辑轨迹上的稳定外推。为评估性能,我们构建了专为压力测试长时序稳定性的新基准数据集。大量实验表明,AnchorEdit在超过10轮交互中仍保持卓越的主体保真度与指令跟随能力,取得最先进结果。