Vision-Language-Action (VLA) models enable robots to understand and perform complex tasks from multimodal input. Although recent work explores using reinforcement learning (RL) to automate the laborious data collection process in scaling supervised fine-tuning (SFT), applying RL to large-scale flow-based VLAs (\eg, $π_0$, $π_{0.5}$) remains challenging due to intractable action log-likelihoods raised from flow matching. We address this challenge with $π_{\texttt{RL}}$, featuring two technical approaches: (1) \textbf{Flow-Noise} models the denoising process as a discrete-time MDP with a learnable noise network for exact log-likelihood computation. (2) \textbf{Flow-SDE} integrates denoising with agent-environment interaction, formulating a two-layer MDP that employs ODE-to-SDE conversion for efficient RL exploration. We evaluate $π_{\texttt{RL}}$ across various benchmarks, with experiments demonstrating that RL yields significant performance improvements in both in-distribution and out-of-distribution settings.
翻译:视觉-语言-动作(VLA)模型使机器人能够理解多模态输入并执行复杂任务。尽管近期研究探索使用强化学习(RL)来自动化监督微调(SFT)规模化过程中繁琐的数据收集工作,但由于流匹配导致难以处理的动作对数似然,将RL应用于大规模基于流的VLA模型(例如$π_0$、$π_{0.5}$)仍具挑战性。我们通过$π_{\texttt{RL}}$应对这一挑战,其具备两项技术方案:(1)**Flow-Noise**将去噪过程建模为具有可学习噪声网络的离散时间马尔可夫决策过程,以实现精确的对数似然计算。(2)**Flow-SDE**将去噪过程与智能体-环境交互相结合,构建了一个采用常微分方程-随机微分方程转换的双层马尔可夫决策过程,以实现高效的RL探索。我们在多个基准测试中评估$π_{\texttt{RL}}$,实验表明强化学习在分布内和分布外场景下均能带来显著的性能提升。