Autoregressive video diffusion models provide a natural formulation for streaming and variable-length video generation by conditioning newly generated frames on previously generated content. However, extending these models to minute-level generation remains challenging: the limited KV-cache budget prevents the model from retaining the full history, while repeatedly conditioning on self-generated frames induces a context distribution shift that accumulates over time, leading to visual artifacts, quality degradation, and temporal drift. In this paper, we propose TetherCache, a training-free and plug-and-play cache management strategy for drift-resistant long video generation. TetherCache organizes the cache into sink, memory, and recent regions, and introduces two complementary mechanisms. First, GRAB (Gated Recall with Attention-Diversity Balancing) selects long-range memory frames using a gated score that combines attention-based relevance with temporal diversity, preserving informative yet diverse historical context under a fixed cache budget. Second, TAME (Trusted Alignment via Memory Editing) lightly edits newly recalled memory tokens by aligning their statistics to a trusted context distribution, reducing the pollution caused by drifted historical features. Built on Self-Forcing, TetherCache consistently improves long-video generation quality on VBench-Long across 30s, 60s, and 240s settings. In particular, for 240s generation, it substantially improves overall and semantic scores while reducing quality drift from 7.84 to 1.33, demonstrating its effectiveness for stable long-horizon autoregressive video diffusion.
翻译:自回归视频扩散模型通过将新生成帧条件化于先前生成内容,为流式及可变长度视频生成提供了天然框架。然而,将这些模型扩展至分钟级生成仍具挑战性:有限的KV缓存预算使模型无法保留完整历史记录,而反复条件化于自生成帧会导致上下文分布偏移随时间累积,引发视觉伪影、质量退化及时序漂移。本文提出TetherCache——一种无需训练、即插即用的缓存管理策略,用于抗漂移长视频生成。TetherCache将缓存划分为驻留区、记忆区与近期区,并引入两种互补机制:首先,GRAB(基于注意力多样性平衡的门控召回)通过结合注意力相关性及时序多样性的门控分数选择长程记忆帧,在固定缓存预算下保留信息丰富且多样的历史上下文;其次,TAME(基于记忆编辑的可信对齐)通过轻量编辑新召回记忆令牌的统计量以对齐至可信上下文分布,减少由漂移历史特征导致的污染。基于Self-Forcing框架,TetherCache在VBench-Long基准的30秒、60秒及240秒设定中持续提升长视频生成质量。特别地,在240秒生成任务中,它显著提升整体与语义分数,同时将质量漂移从7.84降至1.33,验证了其在稳定长时域自回归视频扩散中的有效性。