Vision-language-action (VLA) models achieve strong generalization through large-scale pre-training, but real-world deployment requires expert-level task proficiency in addition to broad generality. Existing post-training approaches for VLA models are typically offline, single-robot, or task-specific, limiting effective on-policy adaptation and scalable learning from real-world interaction. We introduce a Scalable Online Post-training (SOP) system that enables online, distributed, multi-task post-training of generalist VLA models directly in the physical world. SOP tightly couples execution and learning through a closed-loop architecture in which a fleet of robots continuously streams on-policy experience and human intervention signals to a centralized cloud learner, and asynchronously receives updated policies. This design supports prompt on-policy correction, scales experience collection through parallel deployment, and preserves generality during adaptation. SOP is agnostic to the choice of post-training algorithm; we instantiate it with both interactive imitation learning (HG-DAgger) and reinforcement learning (RECAP). Across a range of real-world manipulation tasks including cloth folding, box assembly, and grocery restocking, we show that SOP substantially improves the performance of large pretrained VLA models while maintaining a single shared policy across tasks. Effective post-training can be achieved within hours of real-world interaction, and performance scales near-linearly with the number of robots in the fleet. These results suggest that tightly coupling online learning with fleet-scale deployment is instrumental to enabling efficient, reliable, and scalable post-training of generalist robot policies in the physical world.
翻译:视觉-语言-动作(VLA)模型通过大规模预训练实现了强大的泛化能力,但实际部署不仅需要广泛的通用性,还要求具备专家级的任务熟练度。现有的VLA模型后训练方法通常是离线、单机器人或任务特定的,限制了有效的同策略适应和从真实世界交互中进行的可扩展学习。我们提出了一种可扩展在线后训练(SOP)系统,能够在物理世界中直接对通用VLA模型进行在线、分布式、多任务的后训练。SOP通过闭环架构紧密耦合执行与学习:机器人集群持续将同策略经验与人工干预信号流式传输至集中式云端学习器,并异步接收更新后的策略。该设计支持及时的同策略修正,通过并行部署扩展经验收集,并在适应过程中保持通用性。SOP与后训练算法的选择无关;我们通过交互式模仿学习(HG-DAgger)和强化学习(RECAP)两种方式对其进行了实例化。在包括布料折叠、箱子组装和商品补货在内的一系列真实世界操作任务中,我们证明SOP能显著提升大型预训练VLA模型的性能,同时跨任务保持单一共享策略。有效的后训练可在数小时的真实世界交互内实现,且性能随集群中机器人数量接近线性增长。这些结果表明,将在线学习与集群规模部署紧密耦合,对于在物理世界中实现通用机器人策略的高效、可靠和可扩展后训练具有关键作用。