Generalist robot policies increasingly benefit from large-scale pretraining, but offline data alone is insufficient for robust real-world deployment. Deployed robots encounter distribution shifts, long-tail failures, task variations, and human correction opportunities that fixed demonstration datasets cannot fully capture. We present Learning While Deploying (LWD), a fleet-scale offline-to-online reinforcement learning framework for continual post-training of generalist Vision-Language-Action (VLA) policies. Starting from a pretrained VLA policy, LWD closes the loop between deployment, shared physical experience, policy improvement, and redeployment by using autonomous rollouts and human interventions collected across a robot fleet. To stabilize learning from heterogeneous, sparse-reward fleet data, LWD combines Distributional Implicit Value Learning (DIVL) for robust value estimation with Q-learning via Adjoint Matching (QAM) for policy extraction in flow-based VLA action generators. We validate LWD on a fleet of 16 dual-arm robots across eight real-world manipulation tasks, including semantic grocery restocking and 3--5 minute long-horizon tasks. A single generalist policy improves as fleet experience accumulates, reaching an average success rate of 95%, with the largest gains on long-horizon tasks.
翻译:通用机器人策略日益受益于大规模预训练,但仅靠离线数据无法支撑稳健的现实世界部署。已部署的机器人会遇到固定演示数据集无法完全捕捉的分布偏移、长尾故障、任务变化及人类纠正机会。我们提出"边部署边学习"(LWD)框架,一种面向通用视觉-语言-动作(VLA)策略的车队级离线到在线强化学习持续后训练方法。该框架以预训练VLA策略为起点,通过利用机器人车队采集的自主轨迹数据与人工干预数据,形成部署、共享物理经验、策略改进与重新部署之间的闭环。为从异构稀疏奖励的车队数据中实现稳定学习,LWD结合了分布隐式价值学习(DIVL)进行鲁棒价值估计,以及伴随匹配Q学习(QAM)实现基于流的VLA动作生成器中的策略提取。我们在包含八项真实世界操作任务的16台双臂机器人车队上验证了LWD,涵盖语义级杂货补货与3-5分钟长时域任务。单一通用策略随着车队经验积累持续提升,最终达到95%平均成功率,其中长时域任务获益最显著。