Diffusion Models have become a cornerstone of modern generative AI for their exceptional generation quality and controllability. However, their inherent \textit{multi-step iterations} and \textit{complex backbone networks} lead to prohibitive computational overhead and generation latency, forming a major bottleneck for real-time applications. Although existing acceleration techniques have made progress, they still face challenges such as limited applicability, high training costs, or quality degradation. Against this backdrop, \textbf{Diffusion Caching} offers a promising training-free, architecture-agnostic, and efficient inference paradigm. Its core mechanism identifies and reuses intrinsic computational redundancies in the diffusion process. By enabling feature-level cross-step reuse and inter-layer scheduling, it reduces computation without modifying model parameters. This paper systematically reviews the theoretical foundations and evolution of Diffusion Caching and proposes a unified framework for its classification and analysis. Through comparative analysis of representative methods, we show that Diffusion Caching evolves from \textit{static reuse} to \textit{dynamic prediction}. This trend enhances caching flexibility across diverse tasks and enables integration with other acceleration techniques such as sampling optimization and model distillation, paving the way for a unified, efficient inference framework for future multimodal and interactive applications. We argue that this paradigm will become a key enabler of real-time and efficient generative AI, injecting new vitality into both theory and practice of \textit{Efficient Generative Intelligence}.
翻译:扩散模型凭借其卓越的生成质量和可控性,已成为现代生成式人工智能的基石。然而,其固有的**多步迭代**和**复杂骨干网络**导致了极高的计算开销和生成延迟,构成了实时应用的主要瓶颈。尽管现有加速技术已取得进展,但仍面临适用性有限、训练成本高昂或质量下降等挑战。在此背景下,**扩散缓存**提供了一种前景广阔、无需训练、架构无关的高效推理范式。其核心机制在于识别并重用扩散过程中固有的计算冗余。通过实现特征级的跨步重用和层间调度,它能在不修改模型参数的情况下减少计算量。本文系统回顾了扩散缓存的理论基础与发展脉络,并提出了一个统一的分类与分析框架。通过对代表性方法的比较分析,我们揭示了扩散缓存从**静态重用**到**动态预测**的演进趋势。这一趋势增强了缓存机制在不同任务中的灵活性,并使其能够与采样优化、模型蒸馏等其他加速技术相结合,为未来多模态和交互式应用构建统一高效推理框架铺平了道路。我们认为,该范式将成为实现实时高效生成式人工智能的关键推动力,为**高效生成智能**的理论与实践注入新的活力。