Recent advances in learning decision-making policies can largely be attributed to training expressive policy models, largely via imitation learning. While imitation learning discards non-expert data, reinforcement learning (RL) can still learn from suboptimal data. However, instantiating RL training of a new policy class often presents a different challenge: most deep RL machinery is co-developed with assumptions on the policy class and backbone, resulting in poor performance when the policy class changes. For instance, SAC utilizes a low-variance reparameterization policy gradient for Gaussian policies, but this is unstable for diffusion policies and intractable for autoregressive categorical policies. To address this issue, we develop an offline RL and online fine-tuning approach called policy-agnostic RL (PA-RL) that can effectively train multiple policy classes, with varying architectures and sizes. We build off the basic idea that a universal supervised learning loss can replace the policy improvement step in RL, as long as it is applied on "optimized" actions. To obtain these optimized actions, we first sample multiple actions from a base policy, and run global optimization (i.e., re-ranking multiple action samples using the Q-function) and local optimization (i.e., running gradient steps on an action sample) to maximize the critic on these candidates. PA-RL enables fine-tuning diffusion and transformer policies with either autoregressive tokens or continuous action outputs, at different sizes, entirely via actor-critic RL. Moreover, PA-RL improves the performance and sample-efficiency by up to 2 times compared to existing offline RL and online fine-tuning methods. We show the first result that successfully fine-tunes OpenVLA, a 7B generalist robot policy, autonomously with Cal-QL, an online RL fine-tuning algorithm, improving from 40% to 70% in the real world in 40 minutes.
翻译:近期决策策略学习的进展主要归功于通过模仿学习训练表达能力强的策略模型。虽然模仿学习会舍弃非专家数据,但强化学习(RL)仍能从次优数据中学习。然而,实例化新策略类别的RL训练通常面临不同挑战:大多数深度RL机制是与特定策略类别及骨干网络的假设共同开发的,当策略类别改变时会导致性能下降。例如,SAC利用低方差重参数化策略梯度处理高斯策略,但该方法对扩散策略不稳定,且对自回归分类策略难以处理。为解决此问题,我们开发了一种称为策略无关强化学习(PA-RL)的离线RL与在线微调方法,能够有效训练具有不同架构和规模的多种策略类别。我们的方法基于一个核心思想:只要在“优化”后的动作上应用,通用的监督学习损失可以替代RL中的策略改进步骤。为获得这些优化动作,我们首先从基础策略中采样多个动作,并通过全局优化(即使用Q函数对多个动作样本进行重排序)和局部优化(即对动作样本执行梯度步骤)来最大化这些候选动作的评论家价值。PA-RL能够通过演员-评论家RL完整微调具有自回归标记或连续动作输出的扩散策略与Transformer策略,且适用于不同规模模型。此外,与现有离线RL及在线微调方法相比,PA-RL将性能与样本效率提升高达2倍。我们首次展示了使用在线RL微调算法Cal-QL自主微调7B通用机器人策略OpenVLA的成功案例,在现实世界中40分钟内将成功率从40%提升至70%。