Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention

Autoregressive video diffusion models enable streaming generation, opening the door to long-form synthesis, video world models, and interactive neural game engines. However, their core attention layers become a major bottleneck at inference time: as generation progresses, the KV cache grows, causing both increasing latency and escalating GPU memory, which in turn restricts usable temporal context and harms long-range consistency. In this work, we study redundancy in autoregressive video diffusion and identify three persistent sources: near-duplicate cached keys across frames, slowly evolving (largely semantic) queries/keys that make many attention computations redundant, and cross-attention over long prompts where only a small subset of tokens matters per frame. Building on these observations, we propose a unified, training-free attention framework for autoregressive diffusion: TempCache compresses the KV cache via temporal correspondence to bound cache growth; AnnCA accelerates cross-attention by selecting frame-relevant prompt tokens using fast approximate nearest neighbor (ANN) matching; and AnnSA sparsifies self-attention by restricting each query to semantically matched keys, also using a lightweight ANN. Together, these modules reduce attention, compute, and memory and are compatible with existing autoregressive diffusion backbones and world models. Experiments demonstrate up to x5--x10 end-to-end speedups while preserving near-identical visual quality and, crucially, maintaining stable throughput and nearly constant peak GPU memory usage over long rollouts, where prior methods progressively slow down and suffer from increasing memory usage.

翻译：自回归视频扩散模型支持流式生成，为长序列合成、视频世界模型与交互式神经游戏引擎开辟了道路。然而，其核心注意力层在推理时成为主要瓶颈：随着生成过程的推进，键值缓存不断增长，导致延迟持续增加且GPU内存占用不断攀升，这反过来限制了可用的时序上下文并损害了长程一致性。本研究分析了自回归视频扩散中的冗余性，识别出三个持续存在的来源：跨帧的近似重复缓存键、缓慢演化（主要为语义层面）的查询/键导致大量注意力计算冗余，以及长提示词上的交叉注意力中每帧仅需少量关键标记。基于这些观察，我们提出了一种用于自回归扩散的统一免训练注意力框架：TempCache通过时序对应性压缩键值缓存以限制缓存增长；AnnCA利用快速近似最近邻匹配筛选与帧相关的提示词标记，从而加速交叉注意力；AnnSA则通过将每个查询限制在语义匹配的键上（同样采用轻量级近似最近邻方法）来稀疏化自注意力。这些模块共同降低了注意力计算量与内存占用，且与现有自回归扩散主干及世界模型兼容。实验表明，在保持近乎一致的视觉质量的同时，实现了高达5至10倍的端到端加速；更重要的是，在长序列生成过程中，本方法能够维持稳定的吞吐量与近乎恒定的峰值GPU内存使用，而现有方法则会逐渐变慢并面临内存占用持续增长的问题。