Generating temporally-consistent high-fidelity videos can be computationally expensive, especially over longer temporal spans. More-recent Diffusion Transformers (DiTs) -- despite making significant headway in this context -- have only heightened such challenges as they rely on larger models and heavier attention mechanisms, resulting in slower inference speeds. In this paper, we introduce a training-free method to accelerate video DiTs, termed Adaptive Caching (AdaCache), which is motivated by the fact that "not all videos are created equal": meaning, some videos require fewer denoising steps to attain a reasonable quality than others. Building on this, we not only cache computations through the diffusion process, but also devise a caching schedule tailored to each video generation, maximizing the quality-latency trade-off. We further introduce a Motion Regularization (MoReg) scheme to utilize video information within AdaCache, essentially controlling the compute allocation based on motion content. Altogether, our plug-and-play contributions grant significant inference speedups (e.g. up to 4.7x on Open-Sora 720p - 2s video generation) without sacrificing the generation quality, across multiple video DiT baselines.
翻译:生成时序一致的高保真视频计算成本高昂,尤其在较长的时间跨度上。尽管近期提出的扩散Transformer(DiTs)在此领域取得显著进展,但由于依赖更大规模的模型和更复杂的注意力机制,反而加剧了推理速度缓慢的挑战。本文提出一种无需训练的加速视频DiTs方法——自适应缓存(AdaCache),其核心动机在于"视频生成难度存在差异":即部分视频达到合理质量所需的去噪步数少于其他视频。基于此发现,我们不仅通过扩散过程缓存计算,还针对每次视频生成设计定制化的缓存调度策略,以优化质量与延迟的权衡关系。进一步提出运动正则化(MoReg)方案,在AdaCache中利用视频运动信息,实现基于运动内容的计算资源分配控制。综合这些即插即用模块,我们在多个视频DiT基准模型上实现了显著的推理加速(例如在Open-Sora 720p-2秒视频生成任务中最高达4.7倍),且不损失生成质量。