It is well known that Reinforcement Learning (RL) can be formulated as a convex program with linear constraints. The dual form of this formulation is unconstrained, which we refer to as dual RL, and can leverage preexisting tools from convex optimization to improve the learning performance of RL agents. We show that several state-of-the-art deep RL algorithms (in online, offline, and imitation settings) can be viewed as dual RL approaches in a unified framework. This unification calls for the methods to be studied on common ground, so as to identify the components that actually contribute to the success of these methods. Our unification also reveals that prior off-policy imitation learning methods in the dual space are based on an unrealistic coverage assumption and are restricted to matching a particular f-divergence. We propose a new method using a simple modification to the dual framework that allows for imitation learning with arbitrary off-policy data to obtain near-expert performance.
翻译:众所周知,强化学习可以表述为具有线性约束的凸优化问题。该问题的对偶形式是无约束的,我们称之为对偶强化学习,它可以利用凸优化中已有的工具来提升强化学习智能体的学习性能。我们证明,几种最先进的深度强化学习算法(在线、离线及模仿设置下)均可被视为统一框架下的对偶强化学习方法。这种统一性要求我们在共同基础上研究这些方法,从而识别出真正促成其成功的关键组件。我们的统一性还揭示,先前对偶空间中的离策略模仿学习方法基于不切实际的覆盖假设,并且局限于匹配特定的f-散度。我们提出了一种新方法,通过对对偶框架进行简单修改,使得利用任意离策略数据进行模仿学习成为可能,并能够获得接近专家水平的性能。