Diffusion transformers have gained substantial interest in diffusion generative modeling due to their outstanding performance. However, their high computational cost, arising from the quadratic computational complexity of attention mechanisms and multi-step inference, presents a significant bottleneck. To address this challenge, we propose TokenCache, a novel post-training acceleration method that leverages the token-based multi-block architecture of transformers to reduce redundant computations among tokens across inference steps. TokenCache specifically addresses three critical questions in the context of diffusion transformers: (1) which tokens should be pruned to eliminate redundancy, (2) which blocks should be targeted for efficient pruning, and (3) at which time steps caching should be applied to balance speed and quality. In response to these challenges, TokenCache introduces a Cache Predictor that assigns importance scores to tokens, enabling selective pruning without compromising model performance. Furthermore, we propose an adaptive block selection strategy to focus on blocks with minimal impact on the network's output, along with a Two-Phase Round-Robin (TPRR) scheduling policy to optimize caching intervals throughout the denoising process. Experimental results across various models demonstrate that TokenCache achieves an effective trade-off between generation quality and inference speed for diffusion transformers. Our code will be publicly available.
翻译:扩散Transformer因其卓越性能在扩散生成建模领域受到广泛关注。然而,其注意力机制的二次计算复杂度与多步推理过程导致的高计算成本,构成了显著瓶颈。为应对这一挑战,我们提出TokenCache——一种新颖的训练后加速方法,该方法利用Transformer基于token的多块架构,减少推理步骤间token的冗余计算。TokenCache专门针对扩散Transformer中的三个关键问题:(1)应剪枝哪些token以消除冗余,(2)应针对哪些块进行高效剪枝,(3)应在哪些时间步应用缓存以平衡速度与质量。针对这些挑战,TokenCache引入了缓存预测器,为token分配重要性分数,从而在不影响模型性能的前提下实现选择性剪枝。此外,我们提出自适应块选择策略,专注于对网络输出影响最小的块,并结合两阶段轮询调度策略,在去噪过程中优化缓存间隔。多模型实验结果表明,TokenCache在扩散Transformer的生成质量与推理速度之间实现了有效权衡。我们的代码将公开提供。