Video generation is pivotal to digital media creation, and recent advances in autoregressive video generation have markedly enhanced the efficiency of real-time video synthesis. However, existing approaches generally rely on heuristic KV Cache policies, which ignore differences in token importance in long-term video generation. This leads to the loss of critical spatiotemporal information and the accumulation of redundant, invalid cache, thereby degrading video generation quality and efficiency. To address this limitation, we first observe that token contributions to video generation are highly time-heterogeneous and accordingly propose a novel Past- and Future-Informed KV Cache Policy (PaFu-KV). Specifically, PaFu-KV introduces a lightweight Salience Estimation Head distilled from a bidirectional teacher to estimate salience scores, allowing the KV cache to retain informative tokens while discarding less relevant ones. This policy yields a better quality-efficiency trade-off by shrinking KV cache capacity and reducing memory footprint at inference time. Extensive experiments on benchmarks demonstrate that our method preserves high-fidelity video generation quality while enables accelerated inference, thereby enabling more efficient long-horizon video generation. Our code will be released upon paper acceptance.
翻译:视频生成是数字媒体创作的关键环节,近期自回归视频生成技术的进展显著提升了实时视频合成的效率。然而,现有方法通常依赖于启发式的KV缓存策略,忽视了长时视频生成中不同令牌的重要性差异。这导致关键时空信息的丢失以及冗余无效缓存的累积,从而降低了视频生成的质量与效率。为解决这一局限,我们首先观察到令牌对视频生成的贡献具有高度的时间异质性,并据此提出了一种新颖的过去与未来感知KV缓存策略(PaFu-KV)。具体而言,PaFu-KV引入了一个从双向教师模型蒸馏而来的轻量级显著性估计头,用于评估令牌的显著性分数,使得KV缓存能够保留信息丰富的令牌,同时舍弃相关性较低的令牌。该策略通过缩减KV缓存容量并降低推理时的内存占用,实现了更优的质量-效率权衡。在多个基准测试上的大量实验表明,我们的方法在保持高保真视频生成质量的同时,实现了加速推理,从而支持更高效的长时视频生成。我们的代码将在论文录用后公开。