On-Policy Policy Gradient Reinforcement Learning Without On-Policy Sampling

On-policy reinforcement learning (RL) algorithms perform policy updates using i.i.d. trajectories collected by the current policy. However, after observing only a finite number of trajectories, on-policy sampling may produce data that fails to match the expected on-policy data distribution. This sampling error leads to noisy updates and data inefficient on-policy learning. Recent work in the policy evaluation setting has shown that non-i.i.d., off-policy sampling can produce data with lower sampling error than on-policy sampling can produce. Motivated by this observation, we introduce an adaptive, off-policy sampling method to improve the data efficiency of on-policy policy gradient algorithms. Our method, Proximal Robust On-Policy Sampling (PROPS), reduces sampling error by collecting data with a behavior policy that increases the probability of sampling actions that are under-sampled with respect to the current policy. Rather than discarding data from old policies -- as is commonly done in on-policy algorithms -- PROPS uses data collection to adjust the distribution of previously collected data to be approximately on-policy. We empirically evaluate PROPS on both continuous-action MuJoCo benchmark tasks as well as discrete-action tasks and demonstrate that (1) PROPS decreases sampling error throughout training and (2) improves the data efficiency of on-policy policy gradient algorithms. Our work improves the RL community's understanding of a nuance in the on-policy vs off-policy dichotomy: on-policy learning requires on-policy data, not on-policy sampling.

翻译：同策略强化学习（RL）算法使用由当前策略收集的独立同分布轨迹进行策略更新。然而，在仅观察到有限数量的轨迹后，同策略采样可能产生不符合预期同策略数据分布的数据。这种采样误差会导致更新噪声增大，并降低同策略学习的数据效率。近期策略评估领域的研究表明，非独立同分布的异策略采样产生的数据，其采样误差可低于同策略采样。受此观察启发，我们提出一种自适应的异策略采样方法，旨在提升同策略策略梯度算法的数据效率。我们的方法——近端鲁棒同策略采样（PROPS），通过使用一种行为策略收集数据来降低采样误差，该行为策略能提升对当前策略而言欠采样动作的采样概率。与同策略算法中常见做法不同，PROPS并不丢弃来自旧策略的数据，而是通过数据收集将先前收集数据的分布调整为近似同策略分布。我们在连续动作的MuJoCo基准任务以及离散动作任务上对PROPS进行了实证评估，结果表明：（1）PROPS在训练全程中降低了采样误差，（2）提升了同策略策略梯度算法的数据效率。本研究加深了RL社区对同策略与异策略二分法中细节的理解：同策略学习需要的是同策略数据，而非同策略采样。