Policy gradient methods hold great potential for solving complex continuous control tasks. Still, their training efficiency can be improved by exploiting structure within the optimization problem. Recent work indicates that supervised learning can be accelerated by leveraging the fact that gradients lie in a low-dimensional and slowly-changing subspace. In this paper, we conduct a thorough evaluation of this phenomenon for two popular deep policy gradient methods on various simulated benchmark tasks. Our results demonstrate the existence of such gradient subspaces despite the continuously changing data distribution inherent to reinforcement learning. These findings reveal promising directions for future work on more efficient reinforcement learning, e.g., through improving parameter-space exploration or enabling second-order optimization.
翻译:策略梯度方法在解决复杂连续控制任务方面具有巨大潜力。然而,其训练效率可通过利用优化问题中的结构进一步提升。近期研究表明,监督学习可通过利用梯度存在于低维且缓慢变化的子空间这一特性而加速。本文针对两种主流深度策略梯度方法,在多种模拟基准任务上对此现象进行了全面评估。尽管强化学习存在数据分布持续变化的固有特性,我们的结果仍证实了此类梯度子空间的存在。这些发现为未来更高效强化学习研究(例如通过改进参数空间探索或实现二阶优化)指明了富有前景的方向。