Denoising-based diffusion transformers, despite their strong generation performance, suffer from inefficient training convergence. Existing methods addressing this issue, such as REPA (relying on external representation encoders) or SRA (requiring dual-model setups), inevitably incur heavy computational overhead during training due to external dependencies. To tackle these challenges, this paper proposes \textbf{\namex}, a lightweight intrinsic guidance framework for efficient diffusion training. \name leverages off-the-shelf pre-trained Variational Autoencoder (VAE) features: their reconstruction property ensures inherent encoding of visual priors like rich texture details, structural patterns, and basic semantic information. Specifically, \name aligns the intermediate latent features of diffusion transformers with VAE features via a lightweight projection layer, supervised by a feature alignment loss. This design accelerates training without extra representation encoders or dual-model maintenance, resulting in a simple yet effective pipeline. Extensive experiments demonstrate that \name improves both generation quality and training convergence speed compared to vanilla diffusion transformers, matches or outperforms state-of-the-art acceleration methods, and incurs merely 4\% extra GFLOPs with zero additional cost for external guidance models.
翻译:基于去噪的扩散Transformer模型尽管生成性能强大,但其训练收敛效率低下。现有解决该问题的方法,例如REPA(依赖外部表征编码器)或SRA(需要双模型配置),由于存在外部依赖,在训练过程中不可避免地带来沉重的计算开销。为应对这些挑战,本文提出 \textbf{VAE-REPA},一种用于高效扩散训练的轻量级内在引导框架。VAE-REPA 利用现成的预训练变分自编码器特征:其重建特性确保了其对丰富纹理细节、结构模式和基本语义信息等视觉先验的内在编码。具体而言,VAE-REPA 通过一个轻量级投影层,在特征对齐损失的监督下,将扩散Transformer的中间潜在特征与VAE特征进行对齐。该设计无需额外的表征编码器或双模型维护即可加速训练,形成了一个简洁而有效的流程。大量实验表明,与原始扩散Transformer相比,VAE-REPA 在提升生成质量的同时加快了训练收敛速度,其性能匹配或超越了当前最先进的加速方法,并且仅带来4%的额外GFLOPs开销,而外部引导模型则无需任何额外成本。