Autoregressive (AR) video diffusion enables variable-length synthesis, but long-horizon generation often suffers from accumulated errors and identity drift. For efficiency, existing methods commonly adopt sliding-window attention during generation. This creates an irreversible generation trajectory: once the active window accumulates appearance errors, subsequent generations can only condition on this degraded trajectory and drift further away. We address this limitation by formulating long video generation as a retrieval-augmented generation (RAG) problem. Rather than relying solely on the recent window, we treat previously generated latents as a dynamic, searchable history. We propose LongLive-RAG, a general retrieval framework for AR video generation. At each new block, LongLive-RAG uses a query embedding to retrieve relevant historical latents. This lightweight retrieval step adds only a small overhead relative to generation and lets the generator condition on non-local context instead of only the recent window. To make retrieval more discriminative, we introduce the Window Temporal Delta Loss that suppresses redundant local similarity and encourages embeddings to capture meaningful temporal changes. Together, these components help reduce error accumulation caused by sliding-window attention. Experiments across multiple AR backbones and generation lengths show improved long-video quality and the best average VBench-Long rank. To our knowledge, among open-ended AR long video generation methods, LongLive-RAG is the first to formulate self-generated latent history as content-addressable retrieval memory. Code is available at https://github.com/qixinhu11/LongLive-RAG.
翻译:自回归(AR)视频扩散支持可变长度的视频合成,但长时程生成常受累积误差与身份漂移影响。为提升效率,现有方法通常在生成过程中采用滑动窗口注意力机制。这会产生不可逆的生成轨迹:一旦活跃窗口累积外观误差,后续生成只能基于该退化轨迹进行条件约束,导致漂移进一步加剧。针对此局限,我们将长视频生成重构为检索增强生成(RAG)问题。不同于仅依赖近期窗口,我们将先前生成的潜变量视为动态可搜索的历史记录。我们提出长存-RAG(LongLive-RAG),一种面向自回归视频生成的通用检索框架。在每个新块生成时,长存-RAG 通过查询嵌入检索相关的历史潜变量。该轻量级检索步骤仅增加少量生成开销,使生成器能够基于非局部上下文而非仅近期窗口进行条件约束。为提升检索判别性,我们引入窗口时序三元组损失,该损失抑制冗余局部相似性,鼓励嵌入捕获有意义的时序变化。上述组件共同缓解了由滑动窗口注意力导致的误差累积。在多种自回归主干网络与生成长度上的实验表明,本方法显著提升了长视频质量,并取得最优平均VBench-Long排名。据我们所知,在开放式自回归长视频生成方法中,长存-RAG首次将自生成潜变量历史建模为内容可寻址的检索记忆。代码已开源:https://github.com/qixinhu11/LongLive-RAG。