The goal of reinforcement learning (RL) is to maximize the expected cumulative return. It has been shown that this objective can be represented by an optimization problem of the state-action visitation distribution under linear constraints. The dual problem of this formulation, which we refer to as dual RL, is unconstrained and easier to optimize. We show that several state-of-the-art off-policy deep reinforcement learning (RL) algorithms, under both online and offline, RL and imitation learning (IL) settings, can be viewed as dual RL approaches in a unified framework. This unification provides a common ground to study and identify the components that contribute to the success of these methods and also reveals the common shortcomings across methods with new insights for improvement. Our analysis shows that prior off-policy imitation learning methods are based on an unrealistic coverage assumption and are minimizing a particular f-divergence between the visitation distributions of the learned policy and the expert policy. We propose a new method using a simple modification to the dual RL framework that allows for performant imitation learning with arbitrary off-policy data to obtain near-expert performance, without learning a discriminator. Further, by framing a recent SOTA offline RL method XQL in the dual RL framework, we propose alternative choices to replace the Gumbel regression loss, which achieve improved performance and resolve the training instability issue of XQL. Project code and details can be found at this https://hari-sikchi.github.io/dual-rl.
翻译:强化学习(RL)的目标是最大化期望累积回报。已有研究表明,该目标可转化为状态-动作访问分布在线性约束下的优化问题。该公式的对偶问题(我们称之为双RL)是无约束且更易优化的。我们证明,在在线与离线、RL与模仿学习(IL)等多种设置下,多个最先进的离策略深度强化学习算法均可视为统一框架下的双RL方法。这种统一性为研究并识别这些方法成功的关键组件提供了共同基础,同时揭示了不同方法的共性缺陷,并提出了改进的新见解。我们的分析表明,先前的离策略模仿学习方法基于不现实的覆盖假设,且实质上是最小化所学策略与专家策略访问分布之间的特定f散度。我们提出了一种新方法,通过对双RL框架进行简单修改,即可利用任意离策略数据实现高性能模仿学习,无需学习判别器即可达到接近专家水平的表现。此外,通过将近期SOTA离线RL方法XQL置于双RL框架中分析,我们提出了替代Gumbel回归损失的备选方案,在提升性能的同时解决了XQL的训练不稳定问题。项目代码与详情可访问https://hari-sikchi.github.io/dual-rl。