In recent years, various powerful policy gradient algorithms have been proposed in deep reinforcement learning. While all these algorithms build on the Policy Gradient Theorem, the specific design choices differ significantly across algorithms. We provide a holistic overview of on-policy policy gradient algorithms to facilitate the understanding of both their theoretical foundations and their practical implementations. In this overview, we include a detailed proof of the continuous version of the Policy Gradient Theorem, convergence results and a comprehensive discussion of practical algorithms. We compare the most prominent algorithms on continuous control environments and provide insights on the benefits of regularization. All code is available at https://github.com/Matt00n/PolicyGradientsJax.
翻译:近年来,深度强化学习领域涌现出多种强大的策略梯度算法。尽管这些算法均基于策略梯度定理构建,但不同算法在设计选择上存在显著差异。我们提供一个全面综述,涵盖在线策略梯度算法的理论基础与实践实现,以促进对其深层原理的理解。本综述包含连续版本策略梯度定理的详细证明、收敛性分析结果以及实践算法的系统性讨论。我们将最前沿算法在连续控制环境上进行对比分析,并揭示了正则化策略的优化效益。所有代码开源于 https://github.com/Matt00n/PolicyGradientsJax。