Multimodal Diffusion Transformers (MMDiTs) for text-to-image generation maintain separate text and image branches, with bidirectional information flow between text tokens and visual latents throughout denoising. In this setting, we observe a prompt forgetting phenomenon: the semantics of the prompt representation in the text branch is progressively forgotten as depth increases. We further verify this effect on three representative MMDiTs--SD3, SD3.5, and FLUX.1 by probing linguistic attributes of the representations over the layers in the text branch. Motivated by these findings, we introduce a training-free approach, prompt reinjection, which reinjects prompt representations from early layers into later layers to alleviate this forgetting. Experiments on GenEval, DPG, and T2I-CompBench++ show consistent gains in instruction-following capability, along with improvements on metrics capturing preference, aesthetics, and overall text--image generation quality.
翻译:用于文本到图像生成的多模态扩散Transformer(MMDiTs)在去噪过程中保持独立的文本分支和图像分支,并在文本标记与视觉潜在表示之间维持双向信息流。在此架构下,我们观察到一种提示遗忘现象:文本分支中提示表示的语义信息随网络深度增加而逐渐被遗忘。我们进一步在三种代表性MMDiT模型——SD3、SD3.5和FLUX.1上验证了该现象,通过探测文本分支各层表示的语言学属性确认了此效应。基于这些发现,我们提出了一种无需训练的方法——提示重注入,该方法将早期层的提示表示重新注入到深层网络中以缓解遗忘现象。在GenEval、DPG和T2I-CompBench++数据集上的实验表明,该方法在指令遵循能力方面获得持续提升,同时在偏好度、美学质量和整体文图生成质量等指标上均有改善。