Diffusion transformers have gained substantial interest in diffusion generative modeling due to their outstanding performance. However, their computational demands, particularly the quadratic complexity of attention mechanisms and multi-step inference processes, present substantial bottlenecks that limit their practical applications. To address these challenges, we propose TokenCache, a novel acceleration method that leverages the token-based multi-block architecture of transformers to reduce redundant computations. TokenCache tackles three critical questions: (1) Which tokens should be pruned and reused by the caching mechanism to eliminate redundancy? (2) Which blocks should be targeted for efficient caching? (3) At which time steps should caching be applied to balance speed and quality? In response to these challenges, TokenCache introduces a Cache Predictor that hierarchically addresses these issues by (1) Token pruning: assigning importance scores to each token to determine which tokens to prune and reuse; (2) Block selection: allocating pruning ratio to each block to adaptively select blocks for caching; (3) Temporal Scheduling: deciding at which time steps to apply caching strategies. Experimental results across various models demonstrate that TokenCache achieves an effective trade-off between generation quality and inference speed for diffusion transformers.
翻译:扩散Transformer因其卓越性能在扩散生成建模领域引起了广泛关注。然而,其计算需求——特别是注意力机制的二次复杂度与多步推理过程——构成了显著瓶颈,限制了实际应用。为应对这些挑战,我们提出TokenCache,一种利用Transformer基于令牌的多块架构来减少冗余计算的新型加速方法。TokenCache解决了三个关键问题:(1) 应通过缓存机制修剪和重用哪些令牌以消除冗余?(2) 应针对哪些模块实施高效缓存?(3) 应在哪些时间步应用缓存以平衡速度与质量?针对这些挑战,TokenCache引入了分层解决这些问题的缓存预测器,其通过以下方式实现:(1) 令牌修剪:为每个令牌分配重要性分数以确定修剪和重用的令牌;(2) 模块选择:为每个模块分配修剪比例以自适应选择缓存模块;(3) 时序调度:决定在哪些时间步应用缓存策略。跨多种模型的实验结果表明,TokenCache在扩散Transformer的生成质量与推理速度之间实现了有效权衡。