Diffusion Transformers (DiT) are powerful generative models but remain computationally intensive due to their iterative structure and deep transformer stacks. To alleviate this inefficiency, we propose \textbf{FastCache}, a hidden-state-level caching and compression framework that accelerates DiT inference by exploiting redundancy within the model's internal representations. FastCache introduces a dual strategy: (1) a spatial-aware token selection mechanism that adaptively filters redundant tokens based on hidden-state saliency, and (2) a transformer-level cache that reuses latent activations across timesteps when changes fall below a predefined threshold. These modules work jointly to reduce unnecessary computation while preserving generation fidelity through learnable linear approximation. Theoretical analysis shows that FastCache maintains bounded approximation error under a hypothesis-testing-based decision rule. Empirical evaluations across multiple DiT variants demonstrate substantial reductions in latency and memory usage, achieving the best generation quality among existing cache methods, as measured by FID and t-FID. To further improve the speedup of FastCache, we also introduce a token merging module that merges redundant tokens based on k-NN density. Code is available at \href{https://github.com/NoakLiu/FastCache-xDiT}{https://github.com/NoakLiu/FastCache-xDiT}.
翻译:扩散Transformer(Diffusion Transformers, DiT)作为强大的生成模型,因其迭代架构与深层Transformer堆栈而计算密集。为解决这一低效问题,我们提出**FastCache**——一种面向隐状态级别的缓存与压缩框架,通过挖掘模型内部表征的冗余性加速DiT推理。FastCache采用双重策略:(1)空间感知的令牌选择机制,基于隐状态显著性自适应过滤冗余令牌;(2)Transformer级缓存机制,当跨时间步的潜在激活变化低于预设阈值时复用其计算结果。这两个模块协同运作,通过可学习线性近似在保留生成保真度的同时减少冗余计算。理论分析表明,在基于假设检验的决策规则下,FastCache可维持有界近似误差。跨多种DiT变体的实验评估证实,该方法显著降低了延迟与内存占用,并在FID与t-FID指标上达到现有缓存方法中的最佳生成质量。为进一步提升FastCache加速效果,我们还引入基于k-NN密度的令牌合并模块以融合冗余令牌。代码已开源至:\href{https://github.com/NoakLiu/FastCache-xDiT}{https://github.com/NoakLiu/FastCache-xDiT}。