Vision-Language-Action (VLA) models have emerged as a promising paradigm for grounding visual-language understanding into real-world robotic manipulation. However, dexterous manipulation remains challenging for VLA policies due to high-dimensional hand control and compounding execution errors, which makes real-world RL post-training essential for bridging the gap between visually grounded action generation and physically reliable dexterous execution. However, high-dimensional dexterous exploration often triggers temporal inconsistency, sample inefficiency and hardware risks in the real world. To address these challenges, we propose BORA, an offline-to-online RL post-training framework designed for real-world dexterous VLA models. In the offline phase, BORA constructs a critic that takes both the VLM's cognition tokens and action chunks as inputs. This design enables action-conditioned value guidance, allowing the critic to evaluate dexterous hand motions beyond visual context alone. During the subsequent online phase, BORA freezes the VLA base and introduces a lightweight, Human-in-the-Loop (HiL) chunk-wise residual adaptation mechanism to mitigate real-world execution errors and further correct the offline-learned intents within the actual physical environment. By inheriting the offline critic and employing intervention-driven rewards, BORA effectively corrects execution discrepancies and adapts to real-world physical variances while preserving the pretrained policy as a stable prior. Extensive evaluations across five complex real-world dexterous tasks demonstrate that BORA significantly outperforms pure imitation learning and traditional decoupled RL baselines, achieving a 33% absolute increase in average success rate under standard settings and up to a 43% improvement in unseen object generalization.
翻译:摘要:视觉-语言-动作(VLA)模型已成为将视觉-语言理解映射到真实世界机器人操作任务中的一种极具前景的范式。然而,由于高维手部控制与复合执行误差,灵巧操作对VLA策略仍构成挑战,这使真实环境中的RL后训练成为弥合视觉引导动作生成与物理可靠灵巧执行之间差距的关键。然而,高维灵巧探索常引发真实场景中的时序不一致、样本低效性及硬件风险。为应对这些挑战,我们提出BORA——一个面向真实世界灵巧VLA模型的离线至在线RL后训练框架。在离线阶段,BORA构建一个同时接收VLM认知标记与动作片段作为输入的评判器(Critic)网络。该设计实现了动作条件化的价值引导机制,使评判器能够评估超越视觉上下文的灵巧手部运动。在后续在线阶段,BORA冻结VLA基座网络,引入轻量级人在回路(HiL)片段级残差自适应机制,以缓解真实世界执行误差并进一步修正实际物理环境中离线学习的意图。通过继承离线评判器并结合干预驱动奖励,BORA能有效修正执行偏差,适应真实物理变化,同时将预训练策略作为稳定先验保留。在五个复杂真实世界灵巧任务上的广泛评估表明,BORA显著超越纯模仿学习与传统解耦RL基线,在标准设置下实现平均成功率33%的绝对提升,并在未见物体泛化任务中取得高达43%的改进。