Recent breakthroughs in Diffusion Transformers (DiTs) have revolutionized the field of visual synthesis due to their superior scalability. To facilitate DiTs' capability of capturing meaningful internal representations, recent works such as REPA incorporate external pretrained encoders for representation alignment. However, the underlying mechanisms governing representation learning within DiTs are not well understood. To this end, we first systematically investigate the representation dynamics of DiTs. Through analyzing the evolution and influence of internal representations under various settings, we reveal that representation diversity across blocks is a crucial factor for effective learning. Based on this key insight, we propose DiverseDiT, a novel framework that explicitly promotes representation diversity. DiverseDiT incorporates long residual connections to diversify input representations across blocks and a representation diversity loss to encourage blocks to learn distinct features. Extensive experiments on ImageNet 256x256 and 512x512 demonstrate that our DiverseDiT yields consistent performance gains and convergence acceleration when applied to different backbones with various sizes, even when tested on the challenging one-step generation setting. Furthermore, we show that DiverseDiT is complementary to existing representation learning techniques, leading to further performance gains. Our work provides valuable insights into the representation learning dynamics of DiTs and offers a practical approach for enhancing their performance.
翻译:扩散Transformer(DiTs)的最新突破因卓越的可扩展性而革新了视觉合成领域。为增强DiT捕捉有意义内部表征的能力,近期如REPA等工作引入了外部预训练编码器进行表征对齐。然而,DiT内部表征学习的潜在机制尚未得到充分理解。为此,我们首先系统研究了DiT的表征动态。通过分析不同设置下内部表征的演化与影响,我们发现跨模块的表征多样性是有效学习的关键因素。基于这一重要洞察,我们提出DiverseDiT——一种显式促进表征多样性的新颖框架。DiverseDiT采用长残差连接以多样化跨模块输入表征,并引入表征多样性损失以鼓励模块学习不同的特征。在ImageNet 256x256和512x512上的大量实验表明,我们的DiverseDiT应用于不同大小骨架网络时能持续带来性能提升与收敛加速,即便在极具挑战性的单步生成设置下测试亦然。此外,我们证明DiverseDiT与现有表征学习技术具有互补性,可进一步带来性能增益。我们的工作为DiT的表征学习动态提供了宝贵见解,并提供了增强其性能的实用方法。