Recent research in long-form video generation has shifted from bidirectional to autoregressive models, yet these methods commonly suffer from error accumulation and a loss of long-term coherence. While attention sink frames have been introduced to mitigate this performance decay, they often induce a critical failure mode we term sink-collapse: the generated content repeatedly reverts to the sink frame, resulting in abrupt scene resets and cyclic motion patterns. Our analysis reveals that sink-collapse originates from an inherent conflict between the periodic structure of Rotary Position Embedding (RoPE) and the multi-head attention mechanisms prevalent in current generative models. To address it, we propose a lightweight, training-free approach that effectively suppresses this behavior by introducing multi-head RoPE jitter that breaks inter-head attention homogenization and mitigates long-horizon collapse. Extensive experiments show that our method successfully alleviates sink-collapse while preserving generation quality. To the best of our knowledge, this work achieves the first demonstration of real-time, streaming, and infinite-length video generation with little quality decay. As an illustration of this robustness, we generate continuous videos up to 12 hours in length, which, to our knowledge, is among the longest publicly demonstrated results in streaming video generation.
翻译:近期长视频生成研究已从双向模型转向自回归模型,但这些方法普遍存在误差累积与长期连贯性丧失的问题。尽管注意力汇聚帧被引入以缓解性能衰减,但其常引发我们称为“汇聚塌缩”的关键失效模式:生成内容反复回归至汇聚帧,导致场景突变重置与循环运动模式。我们的分析表明,汇聚塌缩源于旋转位置编码(RoPE)的周期结构与当前生成模型中普遍采用的多头注意力机制之间的固有冲突。为解决此问题,我们提出一种轻量级、免训练的方法,通过引入打破头间注意力同质化的多头RoPE抖动机制,有效抑制该行为并缓解长时程塌缩。大量实验表明,我们的方法在保持生成质量的同时成功缓解了汇聚塌缩。据我们所知,本研究首次实现了质量衰减极小的实时、流式、无限长度视频生成。为证明其鲁棒性,我们生成了长达12小时的连续视频,这据我们所知是流式视频生成领域公开演示中最长的成果之一。