Next Forcing: Causal World Modeling with Multi-Chunk Prediction

Autoregressive video generation has emerged as a powerful paradigm for World Action Models (WAMs). However, existing approaches suffer from slow training convergence and limited converged accuracy, particularly at high frame rates, as the training supervision is confined to the current chunk without explicit signals about future dynamics; they also suffer from slow inference due to iterative video denoising. In this paper, we present Next Forcing, a multi-chunk prediction (MCP) framework for causal world modeling that enables faster training, higher accuracy, and accelerated inference. Inspired by multi-token prediction in large language models, Next Forcing introduces an MCP training objective that augments the main model with lightweight auxiliary MCP modules to simultaneously denoise video chunks at multiple future temporal horizons (next$^1$, next$^2$, next$^3$ chunks). These MCP modules form a causal chain across prediction depths, where intermediate features fused from multiple layers of the main model are leveraged to predict future dynamics, allowing near-future predictions to inform farther-future ones and providing dense multi-scale temporal supervision back to the main model. During training, the MCP modules significantly accelerate convergence and improve converged accuracy, especially at high frame rates: at 50 fps, Next Forcing achieves a 93.1% relative improvement over LingBot-VA at 5k training steps and 2.3x faster convergence, and establishes new state-of-the-art results on the RoboTwin benchmark (94.1/93.5% on Clean/Random). At inference, the MCP modules can be retained to predict the next video chunk in parallel with the current one, achieving 2x inference acceleration. Next Forcing also demonstrates significant improvements on PhyWorld, a benchmark evaluating adherence to physical laws in video generation, and over 50% FVD reduction on general video pretraining.

翻译：自回归视频生成已成为世界行动模型（WAMs）的强大范式。然而，现有方法存在训练收敛缓慢和收敛精度有限的问题，尤其在高帧率场景下，因为训练监督仅限于当前视频块，缺乏对未来动态的显式信号；同时，迭代式视频去噪也导致推理速度缓慢。本文提出"下一代迫近"（Next Forcing），一种用于因果世界建模的多块预测（MCP）框架，可实现更快的训练、更高的精度和加速的推理。受大语言模型中多词元预测的启发，Next Forcing引入MCP训练目标，通过为主模型添加轻量级辅助MCP模块，同时对多个未来时间范围（next¹、next²、next³块）的视频块进行去噪。这些MCP模块形成跨越预测深度的因果链，利用从主模型多层融合的中间特征来预测未来动态，使得近期预测能为远期预测提供信息，并将密集的多尺度时间监督信号反馈给主模型。训练阶段，MCP模块显著加速收敛并提升收敛精度，尤其在高帧率下：在50fps时，Next Forcing在5k训练步数下较LingBot-VA实现93.1%的相对提升和2.3倍的收敛加速，并在RoboTwin基准测试（Clean/Random上分别达到94.1%/93.5%）上树立新业界最优结果。推理阶段，可保留MCP模块，使其与当前块并行预测下一视频块，实现2倍推理加速。Next Forcing在评估视频生成物理规律遵从性的PhyWorld基准测试中也展现出显著改进，并在通用视频预训练任务上实现超过50%的FVD降低。