AtomVLA: Scalable Post-Training for Robotic Manipulation via Predictive Latent World Models

Vision-Language-Action (VLA) models demonstrate remarkable potential for generalizable robotic manipulation. The execution of complex multi-step behaviors in VLA models can be improved by robust instruction grounding, a critical component for effective control. However, current paradigms predominantly rely on coarse, high-level task instructions during supervised fine-tuning. This instruction grounding gap leaves models without explicit intermediate guidance, leading to severe compounding errors in long-horizon tasks. Therefore, bridging this instruction gap and providing scalable post-training for VLA models is urgent. To tackle this problem, we propose \method, the first subtask-aware VLA framework integrated with a scalable offline post-training pipeline. Our framework leverages a large language model to decompose high-level demonstrations into fine-grained atomic subtasks. This approach utilizes a pretrained predictive world model to score candidate action chunks against subtask goals in the latent space, mitigating error accumulation while significantly improving long-horizon robustness. Furthermore, this approach enables highly efficient Group Relative Policy Optimization without the prohibitive expenses associated with online rollouts on physical robots. Extensive simulations validate that our AtomVLA maintains strong robustness under perturbations. When evaluated against fundamental baseline models, it achieves an average success rate of 97.0\% on the LIBERO benchmark and 48.0\% on the LIBERO-PRO benchmark. Finally, experiments conducted in the real world using the Galaxea R1 Lite platform confirm its broad applicability across diverse tasks, especially long-horizon tasks. All datasets, checkpoints, and code will be released to the public domain following the acceptance of this work for future research.

翻译：视觉-语言-动作（VLA）模型在通用化机器人操作方面展现出巨大潜力。通过鲁棒的指令接地，可以提升VLA模型执行复杂多步骤行为的能力，这是实现有效控制的关键组成部分。然而，当前范式在监督微调阶段主要依赖粗略的高层任务指令。这种指令接地缺口使得模型缺乏明确的中间指导，导致长时域任务中出现严重的误差累积。因此，弥合这一指令缺口并为VLA模型提供可扩展的后训练方法迫在眉睫。为解决此问题，我们提出\method，这是首个集成可扩展离线后训练流程的子任务感知VLA框架。该框架利用大型语言模型将高层演示分解为细粒度的原子子任务。该方法通过预训练的预测世界模型在潜在空间中对候选动作块相对于子任务目标进行评分，从而减轻误差累积，同时显著提升长时域任务的鲁棒性。此外，该方法支持高效的组相对策略优化，避免了在物理机器人上进行在线推演的昂贵开销。大量仿真实验验证了我们的AtomVLA在干扰条件下仍保持强鲁棒性。在与基础基线模型的对比评估中，其在LIBERO基准测试中取得了97.0%的平均成功率，在LIBERO-PRO基准测试中达到48.0%。最后，在Galaxea R1 Lite平台上进行的真实世界实验证实了其在不同任务（尤其是长时域任务）中的广泛适用性。所有数据集、检查点及代码将在本文录用后公开发布，以供后续研究使用。