EditMGT: Unleashing Potentials of Masked Generative Transformers in Image Editing

Wei Chow,Linfeng Li,Lingdong Kong,Zefeng Li,Qi Xu,Hang Song,Tian Ye,Xian Wang,Jinbin Bai,Shilin Xu,Xiangtai Li,Junting Pan,Shaoteng Liu,Ran Zhou,Tianshu Yang,Songhua Liu

Recent advances in diffusion models (DMs) have achieved exceptional visual quality in image editing tasks. However, the global denoising dynamics of DMs inherently conflate local editing targets with the full-image context, leading to unintended modifications in non-target regions. In this paper, we shift our attention beyond DMs and turn to Masked Generative Transformers (MGTs) as an alternative approach to tackle this challenge. By predicting multiple masked tokens rather than holistic refinement, MGTs exhibit a localized decoding paradigm that endows them with the inherent capacity to explicitly preserve non-relevant regions during the editing process. Building upon this insight, we introduce the first MGT-based image editing framework, termed EditMGT. We first demonstrate that MGT's cross-attention maps provide informative localization signals for localizing edit-relevant regions and devise a multi-layer attention consolidation scheme that refines these maps to achieve fine-grained and precise localization. On top of these adaptive localization results, we introduce region-hold sampling, which restricts token flipping within low-attention areas to suppress spurious edits, thereby confining modifications to the intended target regions and preserving the integrity of surrounding non-target areas. To train EditMGT, we construct CrispEdit-2M, a high-resolution dataset spanning seven diverse editing categories. Without introducing additional parameters, we adapt a pre-trained text-to-image MGT into an image editing model through attention injection. Extensive experiments across four standard benchmarks demonstrate that, with fewer than 1B parameters, our model achieves similarity performance while enabling 6 times faster editing. Moreover, it delivers comparable or superior editing quality, with improvements of 3.6% and 17.6% on style change and style transfer tasks, respectively.

翻译：近年来，扩散模型（DMs）在图像编辑任务中取得了卓越的视觉质量。然而，DM的全局去噪动力学本质上会将局部编辑目标与全图像上下文混为一谈，导致非目标区域发生意外修改。本文跳出DM的范式，将目光转向掩码生成式Transformer（MGT），将其作为应对这一挑战的替代方案。通过预测多个掩码标记而非整体精化，MGT展现出局部化解码范式，使其天然具备在编辑过程中显式保留无关区域的能力。基于这一洞见，我们提出了首个基于MGT的图像编辑框架，命名为EditMGT。我们首先证明MGT的交叉注意力图能够为定位编辑相关区域提供信息丰富的定位信号，并设计了一种多层注意力整合方案，以精化这些注意力图实现细粒度精准定位。在这些自适应定位结果的基础上，我们引入区域保持采样，通过限制低注意力区域内的标记翻转来抑制虚假编辑，从而将修改约束在预期目标区域内，并保持周边非目标区域的完整性。为训练EditMGT，我们构建了涵盖七个不同编辑类别的高分辨率数据集CrispEdit-2M。在不引入额外参数的前提下，我们通过注意力注入将预训练的文本到图像MGT适配为图像编辑模型。在四个标准基准上的大量实验表明，我们的模型在参数少于10亿的情况下实现了相似性能，同时编辑速度提升6倍。此外，它在风格变换和风格迁移任务上分别取得了3.6%和17.6%的改进，提供了相当或更优的编辑质量。