Diffusion Transformers (DiT) have emerged as a powerful architecture for image and video generation, offering superior quality and scalability. However, their practical application suffers from inherent dynamic feature instability, leading to error amplification during cached inference. Through systematic analysis, we identify the absence of long-range feature preservation mechanisms as the root cause of unstable feature propagation and perturbation sensitivity. To this end, we propose Skip-DiT, a novel DiT variant enhanced with Long-Skip-Connections (LSCs) - the key efficiency component in U-Nets. Theoretical spectral norm and visualization analysis demonstrate how LSCs stabilize feature dynamics. Skip-DiT architecture and its stabilized dynamic feature enable an efficient statical caching mechanism that reuses deep features across timesteps while updating shallow components. Extensive experiments across image and video generation tasks demonstrate that Skip-DiT achieves: (1) 4.4 times training acceleration and faster convergence, (2) 1.5-2 times inference acceleration without quality loss and high fidelity to original output, outperforming existing DiT caching methods across various quantitative metrics. Our findings establish long-skip connections as critical architectural components for training stable and efficient diffusion transformers.
翻译:扩散Transformer(DiT)已成为图像和视频生成的强大架构,展现出卓越的生成质量与可扩展性。然而,其实际应用受限于固有的动态特征不稳定性,导致在缓存推理过程中出现误差放大问题。通过系统分析,我们发现缺乏长程特征保持机制是造成特征传播不稳定和扰动敏感性的根本原因。为此,我们提出Skip-DiT——一种通过长跳跃连接(LSCs)增强的新型DiT变体,该组件正是U-Net架构中的关键效率要素。理论谱范数分析与可视化研究共同揭示了LSCs如何稳定特征动态特性。Skip-DiT的架构及其稳定的动态特征支持高效的静态缓存机制,该机制能够在时间步间复用深层特征,同时更新浅层组件。在图像与视频生成任务上的大量实验表明,Skip-DiT实现了:(1)4.4倍的训练加速与更快的收敛速度;(2)在保持原始输出高保真度且无质量损失的前提下,获得1.5-2倍的推理加速,在多项定量指标上均优于现有DiT缓存方法。我们的研究确立了长跳跃连接作为构建稳定高效扩散Transformer的关键架构组件。