Policy gradient methods are widely adopted reinforcement learning algorithms for tasks with continuous action spaces. These methods succeeded in many application domains, however, because of their notorious sample inefficiency their use remains limited to problems where fast and accurate simulations are available. A common way to improve sample efficiency is to modify their objective function to be computable from off-policy samples without importance sampling. A well-established off-policy objective is the excursion objective. This work studies the difference between the excursion objective and the traditional on-policy objective, which we refer to as the on-off gap. We provide the first theoretical analysis showing conditions to reduce the on-off gap while establishing empirical evidence of shortfalls arising when these conditions are not met.
翻译:策略梯度方法是广泛应用于连续动作空间任务的强化学习算法。尽管这些方法在许多应用领域取得了成功,但由于其众所周知的样本低效性,其使用仍局限于可快速准确模拟的问题。提升样本效率的常见方法是在无需重要性采样的条件下,修改其目标函数以使其能够基于离策略样本计算。一个成熟的离策略目标是游荡目标。本研究探讨了游荡目标与传统在策略目标之间的差异,我们将其称为在离间隙。我们首次提供了理论分析,展示了缩小在离间隙的条件,同时提供了当这些条件不满足时出现缺陷的实证证据。