Large Vision-Language Action (VLA) models have shown significant potential for embodied AI. However, their predominant training via supervised fine-tuning (SFT) limits generalization due to susceptibility to compounding errors under distribution shifts. Reinforcement learning (RL) offers a path to overcome these limitations by optimizing for task objectives via trial-and-error, yet a systematic understanding of its specific generalization benefits for VLAs compared to SFT is lacking. To address this, our study introduces a comprehensive benchmark for evaluating VLA generalization and systematically investigates the impact of RL fine-tuning across diverse visual, semantic, and execution dimensions. Our extensive experiments reveal that RL fine-tuning, particularly with PPO, significantly enhances generalization in semantic understanding and execution robustness over SFT, while maintaining comparable visual robustness. We identify PPO as a more effective RL algorithm for VLAs than LLM-derived methods like DPO and GRPO. We also develop a simple recipe for efficient PPO training on VLAs, and demonstrate its practical utility for improving VLA generalization. The project page is at https://rlvla.github.io
翻译:大型视觉语言动作模型在具身人工智能领域展现出巨大潜力。然而,其主流的监督微调训练方式由于在分布偏移下容易产生复合误差,限制了模型的泛化能力。强化学习通过试错优化任务目标,为克服这些限制提供了途径,但目前尚缺乏对其相较于监督微调为视觉语言动作模型带来的具体泛化效益的系统性理解。为此,本研究引入了一个全面的基准来评估视觉语言动作模型的泛化能力,并系统研究了强化学习微调在视觉、语义和执行等多个维度上的影响。我们的大量实验表明,强化学习微调(特别是采用PPO算法)在语义理解和执行鲁棒性方面相比监督微调显著提升了泛化能力,同时保持了相当的视觉鲁棒性。我们发现PPO是比DPO、GRPO等源自大语言模型的方法更有效的视觉语言动作模型强化学习算法。我们还开发了一套用于视觉语言动作模型高效PPO训练的简明方案,并证明了其在提升视觉语言动作模型泛化能力方面的实际效用。项目页面位于 https://rlvla.github.io