Compared to on-policy counterparts, off-policy model-free deep reinforcement learning can improve data efficiency by repeatedly using the previously gathered data. However, off-policy learning becomes challenging when the discrepancy between the underlying distributions of the agent's policy and collected data increases. Although the well-studied importance sampling and off-policy policy gradient techniques were proposed to compensate for this discrepancy, they usually require a collection of long trajectories and induce additional problems such as vanishing/exploding gradients or discarding many useful experiences, which eventually increases the computational complexity. Moreover, their generalization to either continuous action domains or policies approximated by deterministic deep neural networks is strictly limited. To overcome these limitations, we introduce a novel policy similarity measure to mitigate the effects of such discrepancy in continuous control. Our method offers an adequate single-step off-policy correction that is applicable to deterministic policy networks. Theoretical and empirical studies demonstrate that it can achieve a "safe" off-policy learning and substantially improve the state-of-the-art by attaining higher returns in fewer steps than the competing methods through an effective schedule of the learning rate in Q-learning and policy optimization.
翻译:与同策略方法相比,离策略无模型深度强化学习通过重复使用先前收集的数据能提升数据效率。然而,当智能体策略与收集数据的基础分布差异增大时,离策略学习变得困难。尽管经过充分研究的重要性采样和离策略策略梯度技术被提出以补偿这种差异,但它们通常需要收集长轨迹并引发额外问题,如梯度消失/爆炸或丢弃大量有用经验,最终增加计算复杂度。此外,这些方法在连续动作域或由确定性深度神经网络近似的策略上的推广严格受限。为克服这些限制,我们引入一种新型策略相似度度量来缓解连续控制中此类差异的影响。该方法提供了适用于确定性策略网络的单步离策略校正。理论与实证研究表明,该方法能够实现"安全"的离策略学习,并通过在Q学习和策略优化中有效安排学习率,以更少步数获得更高回报,显著提升了现有技术水平。