Training-free control over editing intensity is a critical requirement for diffusion-based image editing models built on the Diffusion Transformer (DiT) architecture. Existing attention manipulation methods focus exclusively on the Key space to modulate attention routing, leaving the Value space -- which governs feature aggregation -- entirely unexploited. In this paper, we first reveal that both Key and Value projections in DiT's multi-modal attention layers exhibit a pronounced bias-delta structure, where token embeddings cluster tightly around a layer-specific bias vector. Building on this observation, we propose Dual-Channel Attention Guidance (DCAG), a training-free framework that simultaneously manipulates both the Key channel (controlling where to attend) and the Value channel (controlling what to aggregate). We provide a theoretical analysis showing that the Key channel operates through the nonlinear softmax function, acting as a coarse control knob, while the Value channel operates through linear weighted summation, serving as a fine-grained complement. Together, the two-dimensional parameter space $(δ_k, δ_v)$ enables more precise editing-fidelity trade-offs than any single-channel method. Extensive experiments on the PIE-Bench benchmark (700 images, 10 editing categories) demonstrate that DCAG consistently outperforms Key-only guidance across all fidelity metrics, with the most significant improvements observed in localized editing tasks such as object deletion (4.9% LPIPS reduction) and object addition (3.2% LPIPS reduction).
翻译:在基于Diffusion Transformer(DiT)架构的扩散图像编辑模型中,实现无需训练即可控制编辑强度是一项关键需求。现有注意力操纵方法仅聚焦于Key空间以调节注意力路由,而完全未利用控制特征聚合的Value空间。本文首先揭示了DiT多模态注意力层中的Key与Value投影均呈现显著的偏置-增量结构,其中词元嵌入紧密聚集在层特定的偏置向量周围。基于此发现,我们提出双通道注意力引导(DCAG)——一种同时操纵Key通道(控制关注位置)与Value通道(控制聚合内容)的无训练框架。我们通过理论分析表明:Key通道通过非线性softmax函数运作,充当粗粒度控制旋钮;而Value通道通过线性加权求和运作,作为细粒度的补充。二者共同构成的二维参数空间$(δ_k, δ_v)$能够实现比任何单通道方法更精确的编辑-保真度权衡。在PIE-Bench基准测试(700张图像,10个编辑类别)上的大量实验表明,DCAG在所有保真度指标上均持续优于仅使用Key引导的方法,其中在局部编辑任务(如对象删除实现4.9%的LPIPS降低,对象添加实现3.2%的LPIPS降低)中观察到最显著的改进。