Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention

Autoregressive video diffusion models enable streaming generation, opening the door to long-form synthesis, video world models, and interactive neural game engines. However, their core attention layers become a major bottleneck at inference time: as generation progresses, the KV cache grows, causing both increasing latency and escalating GPU memory, which in turn restricts usable temporal context and harms long-range consistency. In this work, we study redundancy in autoregressive video diffusion and identify three persistent sources: near-duplicate cached keys across frames, slowly evolving (largely semantic) queries/keys that make many attention computations redundant, and cross-attention over long prompts where only a small subset of tokens matters per frame. Building on these observations, we propose a unified, training-free attention framework (FAST-AR) for FAST-AutoRegressive diffusion, consisting of three components: TempCache compresses the KV cache via temporal correspondence to bound cache growth; AnnCA accelerates cross-attention by selecting frame-relevant prompt tokens using fast approximate nearest neighbor (ANN) matching; and AnnSA sparsifies self-attention by restricting each query to semantically matched keys, also using a lightweight ANN. Together, these modules reduce attention, compute, and memory and are compatible with existing autoregressive diffusion backbones and world models. Experiments demonstrate up to x5 - x10 end-to-end speedups while preserving near-identical visual quality and, crucially, maintaining stable throughput and nearly constant peak GPU memory usage over long rollouts, where prior methods progressively slow down and suffer from increasing memory usage.

翻译：自回归视频扩散模型支持流式生成，为长视频合成、视频世界模型及交互式神经游戏引擎开辟了道路。然而，其核心注意力层在推理时成为主要瓶颈：随生成过程推进，KV缓存持续增长，导致推理延迟升高与GPU内存占用激增，进而限制可用时序上下文并损害长程一致性。本研究深入分析自回归视频扩散中的冗余现象，识别出三大持续性冗余源：跨帧近重复的键缓存、缓慢演化的（基本语义性）查询/键导致大量注意力计算冗余，以及长提示词上的交叉注意力中每帧仅少数词元发挥作用。基于上述发现，我们提出统一的无训练注意力框架FAST-AR（快速自回归扩散），包含三大组件：TempCache通过时序对应压缩KV缓存以约束缓存增长；AnnCA利用快速近似最近邻（ANN）匹配选取帧相关提示词元来加速交叉注意力；AnnSA通过轻量级ANN将每个查询限定于语义匹配的键以实现自注意力稀疏化。上述模块协同作用，降低注意力计算量、计算开销与显存消耗，并与现有自回归扩散主干网络及世界模型兼容。实验表明，在保持近乎相同的视觉质量前提下，可实现5倍至10倍的端到端加速，更关键的是，在长序列生成过程中维持稳定吞吐量与近乎恒定的峰值GPU内存占用，而现有方法在此场景下会逐渐变慢且内存占用持续增长。