Iterative generative modeling techniques, such as flow matching, provide powerful tools to model complex behaviors for effective offline reinforcement learning (RL). In this work, we propose a new off-policy RL algorithm that trains a flow policy based on prior data. Our idea starts from the "expanded" Markov decision process (MDP) framework, which treats individual flow refinement steps as separate actions in an MDP. To enable off-policy RL within this framework, we apply two techniques: we generate virtual on-policy trajectories (by "reversing" flows) to make this framework compatible with prior data, and we apply a bias-and-variance reduction technique to mitigate the curse of horizon in off-policy RL. We call the resulting algorithm Reversal Q-learning (RQL). RQL has several advantages over previous flow-based RL methods: it does not suffer from backpropagation through time, makes better use of the learned value function, and directly trains the full, expressive flow policy. Through our experiments on 50 challenging simulated robotic tasks, we show that RQL leads to the best average offline RL performance compared to state-of-the-art flow-based offline RL algorithms.
翻译:迭代式生成建模技术(如流匹配)为高效离线强化学习(RL)中的复杂行为建模提供了强大工具。本文提出一种基于先验数据训练流策略的新型离策略强化学习算法。我们的核心思路源于“扩展”马尔可夫决策过程(MDP)框架,该框架将流细化步骤视为MDP中的独立动作。为在此框架中实现离策略强化学习,我们应用了两种技术:通过“逆向”流生成虚拟在策略轨迹,使框架与先验数据兼容;同时采用偏差-方差缩减技术缓解离策略强化学习中的视界诅咒。我们将由此产生的算法命名为逆向Q学习(Reversal Q-learning, RQL)。与以往基于流的强化学习方法相比,RQL具有多项优势:无需沿时间反向传播、能更充分利用学习到的价值函数、可直接训练完整且表达能力强的流策略。通过在50个具有挑战性的模拟机器人任务上的实验表明,与最先进的基于流的离线强化学习算法相比,RQL实现了最优的平均离线强化学习性能。