Pretrained on large-scale and diverse datasets, VLA models demonstrate strong generalization and adaptability as general-purpose robotic policies. However, Supervised Fine-Tuning (SFT), which serves as the primary mechanism for adapting VLAs to downstream domains, requires substantial amounts of task-specific data and is prone to catastrophic forgetting. To address these limitations, we propose LifeLong-RFT, a simple yet effective Reinforcement Fine-Tuning (RFT) strategy for VLA models independent of online environmental feedback and pre-trained reward models. By integrating chunking-level on-policy reinforcement learning with the proposed Multi-Dimensional Process Reward (MDPR) mechanism, LifeLong-RFT quantifies the heterogeneous contributions of intermediate action chunks across three dimensions to facilitate policy optimization. Specifically, (1) the Quantized Action Consistency Reward (QACR) ensures accurate action prediction within the discrete action space; (2) the Continuous Trajectory Alignment Reward (CTAR) aligns decoded continuous action chunks with reference trajectories to ensure precise control; (3) the Format Compliance Reward (FCR) guarantees the structural validity of outputs. Comprehensive experiments across SimplerEnv, LIBERO, and real-world tasks demonstrate that LifeLong-RFT exhibits strong performance in multi-task learning. Furthermore, for continual learning on the LIBERO benchmark, our method achieves a 22% gain in average success rate over SFT, while effectively adapting to new tasks using only 20% of the training data. Overall, our method provides a promising post-training paradigm for VLAs.
翻译:在大规模多样化数据集上预训练的VLA模型展现出作为通用机器人策略的强大泛化与适应能力。然而,作为VLA适应下游领域主要机制的监督微调(SFT)需要大量任务特定数据,且容易发生灾难性遗忘。为应对这些局限,我们提出LifeLong-RFT——一种简单而有效的、独立于在线环境反馈与预训练奖励模型的VLA模型强化微调(RFT)策略。通过将分块级同策略强化学习与我们提出的多维过程奖励(MDPR)机制相结合,LifeLong-RFT从三个维度量化中间动作分块的异质贡献以促进策略优化。具体而言:(1)量化动作一致性奖励(QACR)确保在离散动作空间内动作预测的准确性;(2)连续轨迹对齐奖励(CTAR)使解码的连续动作分块与参考轨迹对齐,以确保精确控制;(3)格式合规奖励(FCR)保障输出结构的有效性。在SimplerEnv、LIBERO及现实任务上的综合实验表明,LifeLong-RFT在多任务学习中表现出强劲性能。此外,在LIBERO基准的持续学习任务中,我们的方法仅使用20%的训练数据即可有效适应新任务,同时平均成功率较SFT提升22%。总体而言,我们的方法为VLA提供了一种有前景的后训练范式。