ALOE: Action-Level Off-Policy Evaluation for Vision-Language-Action Model Post-Training

We study how to improve large foundation vision-language-action (VLA) systems through online reinforcement learning (RL) in real-world settings. Central to this process is the value function, which provides learning signals to guide VLA learning from experience. In practice, the value function is estimated from trajectory fragments collected from different data sources, including historical policies and intermittent human interventions. Estimating the value function of current behavior quality from the mixture data is inherently an off-policy evaluation problem. However, prior work often adopts conservative on-policy estimation for stability, which avoids direct evaluation of the current high-capacity policy and limits learning effectiveness. In this paper, we propose ALOE, an action-level off-policy evaluation framework for VLA post-training. ALOE applies chunking-based temporal-difference bootstrapping to evaluate individual action sequences instead of predicting final task outcomes. This design improves effective credit assignment to critical action chunks under sparse rewards and supports stable policy improvement. We evaluate our method on three real-world manipulation tasks, including smartphone packing as a high-precision task, laundry folding as a long-horizon deformable-object task, and bimanual pick-and-place involving multi-object perception. Across all tasks, ALOE improves learning efficiency without compromising execution speed, showing that off-policy RL can be reintroduced in a reliable manner for real-world VLA post-training. Videos and additional materials are available at our project website.

翻译：本研究探讨如何通过在线强化学习在现实场景中改进大型基础视觉-语言-动作系统。该过程的核心是价值函数，其为VLA从经验中学习提供指导信号。实践中，价值函数通过从不同数据源收集的轨迹片段进行估计，这些数据源包括历史策略和间歇性人工干预。从混合数据中估计当前行为质量的价值函数本质上是离策略评估问题。然而，先前研究常采用保守的同策略估计以确保稳定性，这避免直接评估当前高容量策略，限制了学习效能。本文提出ALOE——面向VLA后训练的动作级离策略评估框架。ALOE采用基于分块的时序差分自举法来评估单个动作序列，而非预测最终任务结果。该设计改进了稀疏奖励下对关键动作块的信用分配效率，并支持稳定的策略提升。我们在三项现实世界操作任务中评估该方法：作为高精度任务的智能手机封装、作为长周期可变形物体任务的衣物折叠，以及涉及多物体感知的双臂抓放操作。在所有任务中，ALOE在不影响执行速度的前提下提升了学习效率，证明离策略强化学习能以可靠方式重新应用于现实世界的VLA后训练。视频及补充材料详见项目网站。