Reinforcement learning (RL) is a powerful paradigm for learning to make sequences of decisions. However, RL has yet to be fully leveraged in robotics, principally due to its lack of scalability. Offline RL offers a promising avenue by training agents on large, diverse datasets, avoiding the costly real-world interactions of online RL. Scaling offline RL to increasingly complex datasets requires expressive generative models such as diffusion and flow matching. However, existing methods typically depend on either backpropagation through time (BPTT), which is computationally prohibitive, or policy distillation, which introduces compounding errors and limits scalability to larger base policies. In this paper, we consider the question of how to develop a scalable offline RL approach without relying on distillation or backpropagation through time. We introduce Expressive Value Learning for Offline Reinforcement Learning (EVOR): a scalable offline RL approach that integrates both expressive policies and expressive value functions. EVOR learns an optimal, regularized Q-function via flow matching during training. At inference-time, EVOR performs inference-time policy extraction via rejection sampling against the expressive value function, enabling efficient optimization, regularization, and compute-scalable search without retraining. Empirically, we show that EVOR outperforms baselines on a diverse set of offline RL tasks, demonstrating the benefit of integrating expressive value learning into offline RL.
翻译:强化学习(RL)是一种用于学习序列决策的强大范式。然而,RL尚未在机器人领域得到充分利用,主要由于其缺乏可扩展性。离线RL通过在大型多样化数据集上训练智能体,避免了在线RL所需的高成本真实世界交互,为此提供了一条有前景的途径。将离线RL扩展到日益复杂的数据集需要表达性生成模型,例如扩散模型和流匹配。然而,现有方法通常依赖于时间反向传播(BPTT)(其计算成本过高)或策略蒸馏(这会引入复合误差并限制向更大基础策略的可扩展性)。在本文中,我们探讨了如何在不依赖蒸馏或时间反向传播的情况下开发可扩展的离线RL方法。我们引入了用于离线强化学习的表达性价值学习(EVOR):一种整合了表达性策略与表达性价值函数的可扩展离线RL方法。EVOR在训练期间通过流匹配学习最优的、正则化的Q函数。在推理阶段,EVOR通过针对表达性价值函数进行拒绝采样来执行推理时策略提取,从而实现无需重新训练的高效优化、正则化和计算可扩展的搜索。实证结果表明,EVOR在多种离线RL任务上优于基线方法,证明了将表达性价值学习整合到离线RL中的优势。