The goal of reinforcement learning (RL) is to find a policy that maximizes the expected cumulative return. It has been shown that this objective can be represented as an optimization problem of state-action visitation distribution under linear constraints. The dual problem of this formulation, which we refer to as dual RL, is unconstrained and easier to optimize. In this work, we first cast several state-of-the-art offline RL and offline imitation learning (IL) algorithms as instances of dual RL approaches with shared structures. Such unification allows us to identify the root cause of the shortcomings of prior methods. For offline IL, our analysis shows that prior methods are based on a restrictive coverage assumption that greatly limits their performance in practice. To fix this limitation, we propose a new discriminator-free method ReCOIL that learns to imitate from arbitrary off-policy data to obtain near-expert performance. For offline RL, our analysis frames a recent offline RL method XQL in the dual framework, and we further propose a new method f-DVL that provides alternative choices to the Gumbel regression loss that fixes the known training instability issue of XQL. The performance improvements by both of our proposed methods, ReCOIL and f-DVL, in IL and RL are validated on an extensive suite of simulated robot locomotion and manipulation tasks. Project code and details can be found at this https://hari-sikchi.github.io/dual-rl.
翻译:强化学习的目标是寻找使期望累积回报最大化的策略。已有研究表明,该目标可转化为线性约束下的状态-动作访问分布优化问题。该公式的对偶问题(我们称之为对偶强化学习)无约束且更易于优化。本研究首先将若干最先进的离线强化学习与离线模仿学习算法统一为具有共享结构的对偶强化学习方法实例。这种统一使我们能够识别先前方法缺陷的根本原因。对于离线模仿学习,我们的分析表明现有方法基于严格覆盖假设,严重限制了实际性能。为解决此局限,我们提出新的无判别器方法ReCOIL,该模型能从任意离策略数据中学习模仿以获得接近专家级的性能。对于离线强化学习,我们的分析将近期离线强化学习方法XQL纳入对偶框架,并提出新方法f-DVL,为Gumbel回归损失提供替代方案,从而解决XQL已知的训练不稳定性问题。我们提出的ReCOIL和f-DVL方法在模仿学习与强化学习中的性能提升,已在涵盖模拟机器人运动与操作任务的大规模测试集中得到验证。项目代码与详细信息可参阅:https://hari-sikchi.github.io/dual-rl