In this paper, we study a natural policy gradient method based on recurrent neural networks (RNNs) for partially-observable Markov decision processes, whereby RNNs are used for policy parameterization and policy evaluation to address curse of dimensionality in non-Markovian reinforcement learning. We present finite-time and finite-width analyses for both the critic (recurrent temporal difference learning), and correspondingly-operated recurrent natural policy gradient method in the near-initialization regime. Our analysis demonstrates the efficiency of RNNs for problems with short-term memory with explicit bounds on the required network widths and sample complexity, and points out the challenges in the case of long-term dependencies.
翻译:本文研究了一种基于循环神经网络(RNNs)的自然策略梯度方法,用于处理部分可观测马尔可夫决策过程。该方法利用RNNs进行策略参数化和策略评估,以应对非马尔可夫强化学习中的维度灾难问题。我们针对近初始化区域内的评论家(循环时序差分学习)以及相应操作的循环自然策略梯度方法,提出了有限时间和有限宽度的分析。我们的分析证明了RNNs在处理具有短期记忆问题时的有效性,并明确了所需网络宽度和样本复杂度的显式界限,同时指出了在存在长期依赖关系时面临的挑战。