Policy gradient methods equipped with deep neural networks have achieved great success in solving high-dimensional reinforcement learning (RL) problems. However, current analyses cannot explain why they are resistant to the curse of dimensionality. In this work, we study the sample complexity of the neural policy mirror descent (NPMD) algorithm with deep convolutional neural networks (CNN). Motivated by the empirical observation that many high-dimensional environments have state spaces possessing low-dimensional structures, such as those taking images as states, we consider the state space to be a $d$-dimensional manifold embedded in the $D$-dimensional Euclidean space with intrinsic dimension $d\ll D$. We show that in each iteration of NPMD, both the value function and the policy can be well approximated by CNNs. The approximation errors are controlled by the size of the networks, and the smoothness of the previous networks can be inherited. As a result, by properly choosing the network size and hyperparameters, NPMD can find an $\epsilon$-optimal policy with $\widetilde{O}(\epsilon^{-\frac{d}{\alpha}-2})$ samples in expectation, where $\alpha\in(0,1]$ indicates the smoothness of environment. Compared to previous work, our result exhibits that NPMD can leverage the low-dimensional structure of state space to escape from the curse of dimensionality, explaining the efficacy of deep policy gradient algorithms.
翻译:配备深度神经网络的策略梯度方法在解决高维强化学习问题中取得了巨大成功。然而,现有分析无法解释其为何能避免维度灾难。本文研究采用深度卷积神经网络(CNN)的神经策略镜像下降(NPMD)算法的样本复杂度。受许多高维环境(如图像状态)的状态空间具有低维结构这一经验观察的启发,我们将状态空间视为嵌入在$D$维欧氏空间中的$d$维流形,其本征维度$d\ll D$。我们证明,在NPMD的每次迭代中,价值函数和策略均可被CNN良好逼近。逼近误差受网络大小控制,且先前网络的平滑性可被继承。因此,通过合理选择网络大小和超参数,NPMD能以期望$\widetilde{O}(\epsilon^{-\frac{d}{\alpha}-2})$的样本复杂度找到$\epsilon$-最优策略,其中$\alpha\in(0,1]$表示环境的平滑性。与先前工作相比,我们的结果表明NPMD能够利用状态空间的低维结构摆脱维度灾难,从而解释了深度策略梯度算法的有效性。