Solving partially observable Markov decision processes (POMDPs) remains a fundamental challenge in reinforcement learning (RL), primarily due to the curse of dimensionality induced by the non-stationarity of optimal policies. In this work, we study a natural actor-critic (NAC) algorithm that integrates recurrent neural network (RNN) architectures into a natural policy gradient (NPG) method and a temporal difference (TD) learning method. This framework leverages the representational capacity of RNNs to address non-stationarity in RL to solve POMDPs while retaining the statistical and computational efficiency of natural gradient methods in RL. We provide non-asymptotic theoretical guarantees for this method, including bounds on sample and iteration complexity to achieve global optimality up to function approximation. Additionally, we characterize pathological cases that stem from long-term dependencies, thereby explaining limitations of RNN-based policy optimization for POMDPs.
翻译:解决部分可观测马尔可夫决策过程(POMDPs)仍然是强化学习(RL)中的一个基本挑战,这主要源于最优策略的非平稳性所引发的维度灾难。在本工作中,我们研究了一种自然行动者-评论家(NAC)算法,该算法将循环神经网络(RNN)架构集成到自然策略梯度(NPG)方法和时序差分(TD)学习方法中。该框架利用RNN的表征能力来处理RL中的非平稳性以求解POMDPs,同时保留了RL中自然梯度方法的统计和计算效率。我们为该算法提供了非渐近的理论保证,包括达到函数逼近意义下的全局最优性所需的样本复杂度和迭代复杂度的上界。此外,我们刻画了源于长期依赖的病态情况,从而解释了基于RNN的策略优化方法在求解POMDPs时的局限性。