In this paper, a novel optimal control-based baseline function is presented for the policy gradient method in deep reinforcement learning (RL). The baseline is obtained by computing the value function of an optimal control problem, which is formed to be closely associated with the RL task. In contrast to the traditional baseline aimed at variance reduction of policy gradient estimates, our work utilizes the optimal control value function to introduce a novel aspect to the role of baseline -- providing guided exploration during policy learning. This aspect is less discussed in prior works. We validate our baseline on robot learning tasks, showing its effectiveness in guided exploration, particularly in sparse reward environments.
翻译:本文提出了一种新颖的基于最优控制的基线函数,用于深度强化学习中的策略梯度方法。该基线通过计算一个与强化学习任务密切关联的最优控制问题的值函数而获得。与旨在降低策略梯度估计方差的传统基线不同,我们的工作利用最优控制值函数为基线的角色引入了一个新的维度——在策略学习过程中提供引导式探索。这一维度在先前工作中较少被讨论。我们在机器人学习任务上验证了所提基线的有效性,结果表明其在引导探索方面具有显著优势,尤其是在稀疏奖励环境中。