Generating temporally-consistent high-fidelity videos can be computationally expensive, especially over longer temporal spans. More-recent Diffusion Transformers (DiTs) -- despite making significant headway in this context -- have only heightened such challenges as they rely on larger models and heavier attention mechanisms, resulting in slower inference speeds. In this paper, we introduce a training-free method to accelerate video DiTs, termed Adaptive Caching (AdaCache), which is motivated by the fact that "not all videos are created equal": meaning, some videos require fewer denoising steps to attain a reasonable quality than others. Building on this, we not only cache computations through the diffusion process, but also devise a caching schedule tailored to each video generation, maximizing the quality-latency trade-off. We further introduce a Motion Regularization (MoReg) scheme to utilize video information within AdaCache, essentially controlling the compute allocation based on motion content. Altogether, our plug-and-play contributions grant significant inference speedups (e.g. up to 4.7x on Open-Sora 720p - 2s video generation) without sacrificing the generation quality, across multiple video DiT baselines.
翻译:生成时间一致的高保真视频在计算上可能非常昂贵,尤其是在较长的时间跨度上。尽管最近的扩散Transformer在此背景下取得了显著进展,但由于其依赖更大的模型和更重的注意力机制,导致推理速度更慢,反而加剧了此类挑战。本文提出了一种无需训练的加速视频扩散Transformer的方法,称为自适应缓存,其动机源于"并非所有视频都生而平等"这一事实:即,某些视频达到合理质量所需的去噪步骤比其他视频更少。基于此,我们不仅通过扩散过程缓存计算,还设计了一种针对每次视频生成量身定制的缓存调度策略,以最大化质量与延迟的权衡。我们进一步引入了一种运动正则化方案,以利用自适应缓存内的视频信息,本质上根据运动内容控制计算分配。总而言之,我们的即插即用贡献在多个视频扩散Transformer基线模型上,实现了显著的推理加速,而无需牺牲生成质量。