Recently, autoregressive (AR) video diffusion models has achieved remarkable performance. However, due to their limited training durations, a train-test gap emerges when testing at longer horizons, leading to rapid visual degradations. Following Self Forcing, which studies the train-test gap within the training duration, this work studies the train-test gap beyond the training duration, i.e., the gap between the limited horizons during training and open-ended horizons during testing. Since open-ended testing can extend beyond any finite training window, and long-video training is computationally expensive, we pursue a training-free solution to bridge this gap. To explore a training-free solution, we conduct a systematic analysis of AR cache maintenance. These insights lead to Rolling Sink. Built on Self Forcing (trained on only 5s clips), Rolling Sink effectively scales the AR video synthesis to ultra-long durations (e.g., 5-30 minutes at 16 FPS) at test time, with consistent subjects, stable colors, coherent structures, and smooth motions. As demonstrated by extensive experiments, Rolling Sink achieves superior long-horizon visual fidelity and temporal consistency compared to SOTA baselines. Project page: https://rolling-sink.github.io/
翻译:近期,自回归(AR)视频扩散模型取得了显著性能。然而,由于其有限的训练时长,在更长时长的测试中会出现训练-测试差距,导致视觉质量迅速退化。继研究训练时长内训练-测试差距的Self Forcing之后,本文研究了超出训练时长的训练-测试差距,即训练期间的有限时长与测试期间的开放式时长之间的差距。由于开放式测试可以超越任何有限的训练窗口,且长视频训练计算成本高昂,我们寻求一种无需额外训练的方法来弥合这一差距。为探索免训练解决方案,我们对AR缓存维护进行了系统性分析。这些洞见催生了滚动汇点方法。基于Self Forcing(仅用5秒片段训练),滚动汇点在测试时能有效将AR视频合成扩展到超长时长(例如16 FPS下5-30分钟),同时保持主体一致、色彩稳定、结构连贯和运动平滑。大量实验表明,与最先进的基线方法相比,滚动汇点实现了更优的长时程视觉保真度和时间一致性。项目页面:https://rolling-sink.github.io/