This work highlights that video world modeling, alongside vision-language pre-training, establishes a fresh and independent foundation for robot learning. Intuitively, video world models provide the ability to imagine the near future by understanding the causality between actions and visual dynamics. Inspired by this, we introduce LingBot-VA, an autoregressive diffusion framework that learns frame prediction and policy execution simultaneously. Our model features three carefully crafted designs: (1) a shared latent space, integrating vision and action tokens, driven by a Mixture-of-Transformers (MoT) architecture, (2) a closed-loop rollout mechanism, allowing for ongoing acquisition of environmental feedback with ground-truth observations, (3) an asynchronous inference pipeline, parallelizing action prediction and motor execution to support efficient control. We evaluate our model on both simulation benchmarks and real-world scenarios, where it shows significant promise in long-horizon manipulation, data efficiency in post-training, and strong generalizability to novel configurations. The code and model are made publicly available to facilitate the community.
翻译:本研究强调,视频世界建模与视觉语言预训练相结合,为机器人学习建立了一个全新且独立的基础。直观而言,视频世界模型通过理解动作与视觉动态之间的因果关系,提供了预测近期未来的能力。受此启发,我们提出了LingBot-VA,一种自回归扩散框架,能够同时学习帧预测与策略执行。我们的模型具备三项精心设计的特点:(1) 共享潜在空间,通过混合Transformer(MoT)架构整合视觉与动作标记;(2) 闭环推演机制,允许通过真实观测持续获取环境反馈;(3) 异步推理流水线,并行执行动作预测与运动控制以支持高效控制。我们在仿真基准测试和真实场景中对模型进行了评估,结果显示其在长时程操作、训练后数据效率以及对新配置的强泛化能力方面均展现出显著潜力。代码与模型已公开发布,以促进相关领域的研究。