We study how to improve large foundation vision-language-action (VLA) systems through online reinforcement learning (RL) in real-world settings. Central to this process is the value function, which provides learning signals to guide VLA learning from experience. In practice, the value function is estimated from trajectory fragments collected from different data sources, including historical policies and intermittent human interventions. Estimating the value function of current behavior quality from the mixture data is inherently an off-policy evaluation problem. However, prior work often adopts conservative on-policy estimation for stability, which avoids direct evaluation of the current high-capacity policy and limits learning effectiveness. In this paper, we propose ALOE, an action-level off-policy evaluation framework for VLA post-training. ALOE applies chunking-based temporal-difference bootstrapping to evaluate individual action sequences instead of predicting final task outcomes. This design improves effective credit assignment to critical action chunks under sparse rewards and supports stable policy improvement. We evaluate our method on three real-world manipulation tasks, including smartphone packing as a high-precision task, laundry folding as a long-horizon deformable-object task, and bimanual pick-and-place involving multi-object perception. Across all tasks, ALOE improves learning efficiency without compromising execution speed, showing that off-policy RL can be reintroduced in a reliable manner for real-world VLA post-training. Videos and additional materials are available at our project website.
翻译:本文研究如何通过在线强化学习在真实世界环境中改进大型基础视觉-语言-动作系统。该过程的核心是价值函数,其通过提供学习信号来指导VLA从经验中学习。实践中,价值函数需从不同数据源收集的轨迹片段进行估计,这些数据源包括历史策略与间歇性人工干预。从混合数据中估计当前行为质量的价值函数本质上属于离策略评估问题。然而,现有研究常采用保守的同策略估计以保证稳定性,这避免了对当前高容量策略的直接评估,从而限制了学习效能。本文提出ALOE——一种面向VLA后训练的动作级离策略评估框架。ALOE采用基于分块的时序差分自举法来评估单个动作序列,而非预测最终任务结果。该设计能在稀疏奖励条件下对关键动作块实现更有效的信用分配,并支持稳定的策略改进。我们在三项真实世界操作任务上评估了所提方法:作为高精度任务的智能手机包装、作为长周期可变形物体任务的衣物折叠,以及涉及多物体感知的双臂抓取放置任务。在所有任务中,ALOE在不影响执行速度的前提下提升了学习效率,表明离策略强化学习能够以可靠方式重新应用于真实世界VLA后训练。演示视频与补充材料详见项目网站。