Experience replay (ER) used in (deep) reinforcement learning is considered to be applicable only to off-policy algorithms. However, there have been some cases in which ER has been applied for on-policy algorithms, suggesting that off-policyness might be a sufficient condition for applying ER. This paper reconsiders more strict "experience replayable conditions" (ERC) and proposes the way of modifying the existing algorithms to satisfy ERC. In light of this, it is postulated that the instability of policy improvements represents a pivotal factor in ERC. The instability factors are revealed from the viewpoint of metric learning as i) repulsive forces from negative samples and ii) replays of inappropriate experiences. Accordingly, the corresponding stabilization tricks are derived. As a result, it is confirmed through numerical simulations that the proposed stabilization tricks make ER applicable to an advantage actor-critic, an on-policy algorithm. Moreover, its learning performance is comparable to that of a soft actor-critic, a state-of-the-art off-policy algorithm.
翻译:(深度)强化学习中使用的经验回放通常被认为仅适用于离策略算法。然而,已有研究将经验回放应用于同策略算法的案例,这表明离策略性可能只是应用经验回放的充分条件。本文重新审视了更为严格的“经验回放适用条件”,并提出通过修改现有算法以满足该条件的方法。研究指出,策略改进的不稳定性是影响经验回放适用性的关键因素。从度量学习的角度,这种不稳定性可归因于两方面:i) 负样本产生的排斥力,以及 ii) 不适当经验的重放。基于此,本文推导出相应的稳定化技巧。数值模拟实验证实,所提出的稳定化技巧能使经验回放成功应用于同策略算法——优势行动者-评论者算法,且其学习性能可与当前最先进的离策略算法——柔性行动者-评论者算法相媲美。