Hierarchical Advantage Weighting for Online RL Fine-Tuning of VLAs from Sparse Episode Outcomes

When pretrained VLA policies are fine-tuned through online RL, each rollout episode produces only a single binary outcome (success or failure), yet the actor update requires per-transition supervision. Existing approaches commonly reduce this sparse outcome to a single scalar reward or advantage signal, which conflates distinct forms of transition-level feedback and provides limited guidance once basic task success becomes achievable. First, a single scalar signal conflates the two objectives of viability and efficiency; once basic success is achieved, the binary label provides no gradient to distinguish efficient completions from slow ones. Second, real-world rollouts mix autonomous and intervention segments; naively assigning episode outcomes across these boundaries introduces incorrect credit assignment. To address these issues, we propose Hierarchical Advantage-Weighted Behavior Cloning (HABC), which trains separate critic heads for these two objectives on different data subsets and combines their outputs with a state-adaptive balance. A state-adaptive gate $g_t$ merges their one-step advantages, prioritizing viability when success is uncertain and shifting to efficiency only when viability is high, and converts the result into per-transition weights on the actor loss. Intervention-aware credit assignment further restricts outcome labels to segments executed by the current policy, preventing supervision from leaking across intervention boundaries. In real-robot experiments on three contact-rich bimanual tasks, HABC raises success from supervised fine-tuning (SFT) baselines of 36%, 44%, and 12% to 92%, 88%, and 38%.

翻译：当预训练的VLA策略通过在线强化学习进行微调时，每个回合的 rollout 仅产生单一二元结果（成功或失败），但 actor 更新需要每个时间步的监督信号。现有方法通常将此稀疏结果简化为单个标量奖励或优势信号，这混淆了不同形式的时间步级反馈，并在基本任务成功变得可实现后提供有限指导。首先，单一标量信号混淆了可行性和效率两个目标；一旦基本成功达成，二元标签无法提供梯度来区分高效完成与缓慢完成。其次，真实世界 rollout 混合了自主段和干预段；天真地将回合结果跨越这些边界分配会导致错误的信用分配。为解决这些问题，我们提出分层优势加权行为克隆（HABC），该方法针对不同数据子集为这两个目标训练独立的评论家头部，并通过状态自适应平衡组合其输出。状态自适应门控 $g_t$ 合并其一时间步优势，在成功不确定时优先考虑可行性，仅当可行性高时才转向效率，并将结果转换为 actor 损失上的每个时间步权重。干预感知的信用分配进一步将结果标签限制在当前策略执行的段，防止监督信号跨越干预边界泄漏。在三个接触丰富的双臂机器人真实实验中，HABC将监督微调（SFT）基线的成功率从36%、44%和12%提升至92%、88%和38%。