BiWM: Advancing Open-Source Interactive Video World Models with Bidirectional Autoregression

from arxiv, After the paper was posted, we discovered that several visualization results were produced using wrong configuration settings during runtime. This error affects the reliability of the presented visual comparisons. Additionally, further optimization of the design is needed. We therefore request to withdraw this version and will submit a corrected and improved version later

Transitioning bidirectional video diffusion models into an autoregressive paradigm improves the interactivity of video world models, but existing causal pipelines need many stages (control fine-tuning, autoregressive training, causal initialization, few-step distillation) and still trail bidirectional models in quality due to error accumulation. Recent world models such as Yume-1.5 and Matrix-Game-3.0 instead adopt a bidirectional autoregressive approach, gaining fidelity and stable long-horizon rollout from self-correcting error propagation, yet open-source frameworks (e.g., minWM) support only causal models. We present BiWM, the first full-stack framework for interactive video world models under the bidirectional autoregressive paradigm, jointly optimizing generation quality and inference speed. From a pretrained video backbone, BiWM injects camera control by fine-tuning, then runs a few-step Distribution Matching Distillation (DMD) stage that turns the backbone into an action/camera-controllable world model: just two training stages instead of four in minWM, converging in a few hundred steps on 8xH200 GPUs. A single recipe spans Wan2.1-1.3B, Wan2.2-5B, HunyuanVideo-1.5-8B, and LTX-2.3-22B, and also supports secondary fine-tuning of existing bidirectional models. BiWM enables real-world camera control where minWM loses controllability, integrates pluggable history compression (FramePack-style and PackForcing-style) for long rollouts, and offers an optional NVFP4 4-bit training/inference pipeline. To counter DMD's mode-seeking degradation, we add GAN and mass-covering forward-KL objectives that preserve scene dynamics. We open-source BiWM for resource-constrained research and high-fidelity environment simulation.

翻译：将双向视频扩散模型转换为自回归范式可提升视频世界模型的交互性，但现有因果流水线需要多阶段（控制微调、自回归训练、因果初始化、少步蒸馏），且因误差累积导致质量仍落后于双向模型。近期诸如Yume-1.5和Matrix-Game-3.0的世界模型转而采用双向自回归方法，通过自我校正误差传播机制获得高保真度与稳定的长程推演能力，然而开源框架（如minWM）仅支持因果模型。我们提出BiWM——首个基于双向自回归范式的交互式视频世界模型全栈框架，联合优化生成质量与推理速度。BiWM从预训练视频骨干出发，通过微调注入相机控制，随后运行少步分布匹配蒸馏（DMD）阶段，将骨干网络转化为动作/相机可控的世界模型：仅需两个训练阶段（较minWM的四个阶段减半），在8xH200 GPU上数百步内收敛。单一方案适配Wan2.1-1.3B、Wan2.2-5B、HunyuanVideo-1.5-8B及LTX-2.3-22B等模型，同时支持对现有双向模型的二次微调。BiWM在minWM失去控制力的场景中实现真实相机控制，集成可插拔历史压缩模块（FramePack风格与PackForcing风格）支持长程推演，并提供可选的NVFP4 4位训练/推理流水线。为对抗DMD的模态搜索退化问题，我们引入GAN与质量覆盖前向KL散度目标以保持场景动态性。我们将BiWM开源，助力资源受限的研究与高保真环境仿真。