Diffusion Transformers (DiTs) have exhibited robust capabilities in image generation tasks. However, accurate text-guided image editing for multimodal DiTs (MM-DiTs) still poses a significant challenge. Unlike UNet-based structures that could utilize self/cross-attention maps for semantic editing, MM-DiTs inherently lack support for explicit and consistent incorporated text guidance, resulting in semantic misalignment between the edited results and texts. In this study, we disclose the sensitivity of different attention heads to different image semantics within MM-DiTs and introduce HeadRouter, a training-free image editing framework that edits the source image by adaptively routing the text guidance to different attention heads in MM-DiTs. Furthermore, we present a dual-token refinement module to refine text/image token representations for precise semantic guidance and accurate region expression. Experimental results on multiple benchmarks demonstrate HeadRouter's performance in terms of editing fidelity and image quality.
翻译:扩散Transformer(DiTs)在图像生成任务中展现出强大的能力。然而,针对多模态DiT(MM-DiT)实现精确的文本引导图像编辑仍是一个重大挑战。与能够利用自注意力/交叉注意力图进行语义编辑的UNet架构不同,MM-DiT本质上缺乏对显式且一致的文本引导的支持,导致编辑结果与文本之间存在语义错位。在本研究中,我们揭示了MM-DiT中不同注意力头对不同图像语义的敏感性,并提出了HeadRouter——一个免训练的图像编辑框架,它通过将文本引导自适应地路由到MM-DiT中的不同注意力头来编辑源图像。此外,我们提出了一个双令牌精炼模块,用于优化文本/图像令牌表示,以实现精确的语义引导和准确的区域表达。在多个基准测试上的实验结果表明,HeadRouter在编辑保真度和图像质量方面均表现出色。