Stateful policies play an important role in reinforcement learning, such as handling partially observable environments, enhancing robustness, or imposing an inductive bias directly into the policy structure. The conventional method for training stateful policies is Backpropagation Through Time (BPTT), which comes with significant drawbacks, such as slow training due to sequential gradient propagation and the occurrence of vanishing or exploding gradients. The gradient is often truncated to address these issues, resulting in a biased policy update. We present a novel approach for training stateful policies by decomposing the latter into a stochastic internal state kernel and a stateless policy, jointly optimized by following the stateful policy gradient. We introduce different versions of the stateful policy gradient theorem, enabling us to easily instantiate stateful variants of popular reinforcement learning and imitation learning algorithms. Furthermore, we provide a theoretical analysis of our new gradient estimator and compare it with BPTT. We evaluate our approach on complex continuous control tasks, e.g., humanoid locomotion, and demonstrate that our gradient estimator scales effectively with task complexity while offering a faster and simpler alternative to BPTT.
翻译:状态策略在强化学习中扮演着重要角色,例如处理部分可观测环境、增强鲁棒性,或直接将归纳偏置引入策略结构。传统训练状态策略的方法是随时间反向传播(BPTT),但该方法存在显著缺陷,例如因顺序梯度传播导致的训练缓慢以及梯度消失或爆炸问题。为应对这些挑战,通常采用梯度截断策略,但这会导致策略更新存在偏差。我们提出了一种训练状态策略的新方法,通过将状态策略分解为随机内部状态核与无状态策略两部分,并遵循状态策略梯度进行联合优化。我们引入了不同版本的状态策略梯度定理,从而能够轻松实例化主流强化学习和模仿学习算法的状态变体。此外,我们提供了对新梯度估计器的理论分析,并与BPTT进行了对比。在复杂连续控制任务(如人形机器人运动)上的评估表明,我们的梯度估计器能随任务复杂度有效扩展,同时作为BPTT的替代方案,具有更快、更简洁的优势。