Recent breakthroughs of transformer-based diffusion models, particularly with Multimodal Diffusion Transformers (MMDiT) driven models like FLUX and Qwen Image, have facilitated thrilling experiences in text-to-image generation and editing. To understand the internal mechanism of MMDiT-based models, existing methods tried to analyze the effect of specific components like positional encoding and attention layers. Yet, a comprehensive understanding of how different blocks and their interactions with textual conditions contribute to the synthesis process remains elusive. In this paper, we first develop a systematic pipeline to comprehensively investigate each block's functionality by removing, disabling and enhancing textual hidden-states at corresponding blocks. Our analysis reveals that 1) semantic information appears in earlier blocks and finer details are rendered in later blocks, 2) removing specific blocks is usually less disruptive than disabling text conditions, and 3) enhancing textual conditions in selective blocks improves semantic attributes. Building on these observations, we further propose novel training-free strategies for improved text alignment, precise editing, and acceleration. Extensive experiments demonstrated that our method outperforms various baselines and remains flexible across text-to-image generation, image editing, and inference acceleration. Our method improves T2I-Combench++ from 56.92% to 63.00% and GenEval from 66.42% to 71.63% on SD3.5, without sacrificing synthesis quality. These results advance understanding of MMDiT models and provide valuable insights to unlock new possibilities for further improvements.
翻译:基于Transformer的扩散模型近期取得突破性进展,特别是以多模态扩散Transformer(MMDiT)为驱动的模型(如FLUX和Qwen Image),为文本到图像生成与编辑带来了激动人心的体验。为理解基于MMDiT模型的内部机制,现有方法尝试分析特定组件(如位置编码和注意力层)的作用。然而,对于不同模块及其与文本条件的交互如何共同影响合成过程,仍缺乏系统性认知。本文首先构建了一套系统化分析流程,通过移除、禁用及增强对应模块的文本隐藏状态,全面探究各模块的功能特性。我们的分析表明:1)语义信息在早期模块中显现,而细节特征在后期模块中呈现;2)移除特定模块通常比禁用文本条件产生的干扰更小;3)在选择性模块中增强文本条件可改善语义属性。基于这些发现,我们进一步提出新颖的无训练策略,以提升文本对齐精度、实现精准编辑并加速推理。大量实验表明,该方法在文本到图像生成、图像编辑和推理加速任务中均优于多种基线方法,且保持良好适应性。在SD3.5模型上,我们的方法将T2I-Combench++指标从56.92%提升至63.00%,GenEval指标从66.42%提升至71.63%,且未牺牲合成质量。这些成果深化了对MMDiT模型的理解,并为挖掘进一步改进潜力提供了重要见解。