Efficient video generation models are increasingly vital for multimedia synthetic content generation. Leveraging the Transformer architecture and the diffusion process, video DiT models have emerged as a dominant approach for high-quality video generation. However, their multi-step iterative denoising process incurs high computational cost and inference latency. Caching, a widely adopted optimization method in DiT models, leverages the redundancy in the diffusion process to skip computations in different granularities (e.g., step, cfg, block). Nevertheless, existing caching methods are limited to single-granularity strategies, struggling to balance generation quality and inference speed in a flexible manner. In this work, we propose MixCache, a training-free caching-based framework for efficient video DiT inference. It first distinguishes the interference and boundary between different caching strategies, and then introduces a context-aware cache triggering strategy to determine when caching should be enabled, along with an adaptive hybrid cache decision strategy for dynamically selecting the optimal caching granularity. Extensive experiments on diverse models demonstrate that, MixCache can significantly accelerate video generation (e.g., 1.94$\times$ speedup on Wan 14B, 1.97$\times$ speedup on HunyuanVideo) while delivering both superior generation quality and inference efficiency compared to baseline methods.
翻译:高效视频生成模型对于多媒体合成内容生成日益重要。基于Transformer架构和扩散过程,视频DiT模型已成为高质量视频生成的主流方法。然而,其多步迭代去噪过程导致高昂的计算成本和推理延迟。缓存作为DiT模型中广泛采用的优化方法,利用扩散过程中的冗余性在不同粒度(如步骤、cfg、模块)上跳过计算。然而,现有缓存方法局限于单粒度策略,难以灵活平衡生成质量与推理速度。本文提出MixCache,一种无需训练的基于缓存的框架,用于实现高效视频DiT推理。该框架首先区分不同缓存策略间的干扰与边界,进而引入上下文感知的缓存触发策略以确定何时启用缓存,并结合自适应混合缓存决策策略动态选择最优缓存粒度。在多种模型上的大量实验表明,MixCache能显著加速视频生成(例如在Wan 14B上实现1.94$\times$加速,在HunyuanVideo上实现1.97$\times$加速),同时在生成质量和推理效率方面均优于基线方法。