We study reinforcement learning (RL) in the setting of continuous time and space, for an infinite horizon with a discounted objective and the underlying dynamics driven by a stochastic differential equation. Built upon recent advances in the continuous approach to RL, we develop a notion of occupation time (specifically for a discounted objective), and show how it can be effectively used to derive performance-difference and local-approximation formulas. We further extend these results to illustrate their applications in the PG (policy gradient) and TRPO/PPO (trust region policy optimization/ proximal policy optimization) methods, which have been familiar and powerful tools in the discrete RL setting but under-developed in continuous RL. Through numerical experiments, we demonstrate the effectiveness and advantages of our approach.
翻译:我们研究了连续时间和空间下的强化学习(RL),目标为无限时域折扣目标,且底层动力学由随机微分方程驱动。基于近期连续方法在RL中的进展,我们提出了占用时间的概念(特别针对折扣目标),并展示了如何将其有效用于推导性能差异公式和局部逼近公式。我们进一步扩展这些结果,以说明它们在策略梯度(PG)和信任区域策略优化/近端策略优化(TRPO/PPO)方法中的应用——这些方法在离散RL中是熟悉且强大的工具,但在连续RL中尚未得到充分发展。通过数值实验,我们证明了该方法的有效性和优势。