Autoregressive video world models predict future visual observations conditioned on actions. While effective over short horizons, these models often struggle with long-horizon generation, as small prediction errors accumulate over time. Prior methods alleviate this by introducing pre-trained teacher models and sequence-level distribution matching, which incur additional computational cost and fail to prevent error propagation beyond the training horizon. In this work, we propose LIVE, a Long-horizon Interactive Video world modEl that enforces bounded error accumulation via a novel cycle-consistency objective, thereby eliminating the need for teacher-based distillation. Specifically, LIVE first performs a forward rollout from ground-truth frames and then applies a reverse generation process to reconstruct the initial state. The diffusion loss is subsequently computed on the reconstructed terminal state, providing an explicit constraint on long-horizon error propagation. Moreover, we provide an unified view that encompasses different approaches and introduce progressive training curriculum to stabilize training. Experiments demonstrate that LIVE achieves state-of-the-art performance on long-horizon benchmarks, generating stable, high-quality videos far beyond training rollout lengths.
翻译:自回归视频世界模型根据给定动作预测未来视觉观测。尽管在短时程预测中表现有效,但这些模型在长时程生成中常面临困难,因为微小的预测误差会随时间累积。现有方法通过引入预训练教师模型和序列级分布匹配来缓解此问题,但这会带来额外计算成本,且无法阻止误差在训练时程之外的传播。本研究提出LIVE——一种通过新颖的循环一致性目标强制误差累积有界的长时程交互式视频世界模型,从而无需基于教师的蒸馏。具体而言,LIVE首先从真实帧执行前向推演,随后应用反向生成过程重构初始状态。扩散损失随后在重构的终止状态上计算,为长时程误差传播提供显式约束。此外,我们提出了涵盖不同方法的统一框架,并引入渐进式训练课程以稳定训练过程。实验表明,LIVE在长时程基准测试中实现了最先进的性能,能够生成远超训练推演长度的稳定高质量视频。