In deep reinforcement learning, policy optimization methods need to deal with issues such as function approximation and the reuse of off-policy data. Standard policy gradient methods do not handle off-policy data well, leading to premature convergence and instability. This paper introduces a method to stabilize policy optimization when off-policy data are reused. The idea is to include a Bregman divergence between the behavior policy that generates the data and the current policy to ensure small and safe policy updates with off-policy data. The Bregman divergence is calculated between the state distributions of two policies, instead of only on the action probabilities, leading to a divergence augmentation formulation. Empirical experiments on Atari games show that in the data-scarce scenario where the reuse of off-policy data becomes necessary, our method can achieve better performance than other state-of-the-art deep reinforcement learning algorithms.
翻译:在深度强化学习中,策略优化方法需要处理函数逼近和离策略数据重用等问题。标准策略梯度方法无法很好地处理离策略数据,容易导致早熟收敛和不稳定性。本文提出一种在重用离策略数据时稳定策略优化的方法。其核心思想是在生成数据的行为策略与当前策略之间引入布雷格曼散度,以确保使用离策略数据时能进行小幅且安全的策略更新。该布雷格曼散度计算的是两种策略状态分布之间的差异,而非仅基于动作概率,从而形成发散增强的数学表达。在Atari游戏上的实证实验表明,在数据稀缺且必须重用离策略数据的场景下,本方法能够取得优于其他前沿深度强化学习算法的性能表现。