In recent years, various powerful policy gradient algorithms have been proposed in deep reinforcement learning. While all these algorithms build on the Policy Gradient Theorem, the specific design choices differ significantly across algorithms. We provide a holistic overview of on-policy policy gradient algorithms to facilitate the understanding of both their theoretical foundations and their practical implementations. In this overview, we include a detailed proof of the continuous version of the Policy Gradient Theorem, convergence results and a comprehensive discussion of practical algorithms. We compare the most prominent algorithms on continuous control environments and provide insights on the benefits of regularization. All code is available at https://github.com/Matt00n/PolicyGradientsJax.
翻译:近年来,深度强化学习领域提出了多种强大的策略梯度算法。尽管这些算法均基于策略梯度定理,但不同算法的具体设计选择存在显著差异。本文对同策略策略梯度算法进行整体综述,旨在促进对其理论基础与实践实现的理解。本综述包含连续版本策略梯度定理的详细证明、收敛性结果以及实践算法的全面讨论。我们在连续控制环境中比较了最著名的算法,并深入探讨了正则化的优势。所有代码均可见于 https://github.com/Matt00n/PolicyGradientsJax。