Policy-based algorithms equipped with deep neural networks have achieved great success in solving high-dimensional policy optimization problems in reinforcement learning. However, current analyses cannot explain why they are resistant to the curse of dimensionality. In this work, we study the sample complexity of the neural policy mirror descent (NPMD) algorithm with convolutional neural networks (CNN) as function approximators. Motivated by the empirical observation that many high-dimensional environments have state spaces possessing low-dimensional structures, such as those taking images as states, we consider the state space to be a $d$-dimensional manifold embedded in the $D$-dimensional Euclidean space with intrinsic dimension $d\ll D$. We show that in each iteration of NPMD, both the value function and the policy can be well approximated by CNNs. The approximation errors are controlled by the size of the networks, and the smoothness of the previous networks can be inherited. As a result, by properly choosing the network size and hyperparameters, NPMD can find an $\epsilon$-optimal policy with $\widetilde{O}(\epsilon^{-\frac{d}{\alpha}-2})$ samples in expectation, where $\alpha\in(0,1]$ indicates the smoothness of environment. Compared to previous work, our result exhibits that NPMD can leverage the low-dimensional structure of state space to escape from the curse of dimensionality, providing an explanation for the efficacy of deep policy-based algorithms.
翻译:基于深度神经网络的策略算法在解决强化学习中的高维策略优化问题上取得了巨大成功。然而,现有分析无法解释为何这些算法能避免维度灾难。本文研究了以卷积神经网络(CNN)作为函数逼近器的神经策略镜像下降(NPMD)算法的样本复杂度。受许多高维环境状态空间具有低维结构(例如以图像为状态的场景)这一经验观察的启发,我们将状态空间视为嵌入在$D$维欧氏空间中的$d$维流形,其本征维数$d\ll D$。我们证明,在NPMD的每次迭代中,价值函数和策略均可由CNN良好逼近。逼近误差由网络规模控制,且先前网络的平滑性可被继承。因此,通过合理选择网络规模与超参数,NPMD能以期望$\widetilde{O}(\epsilon^{-\frac{d}{\alpha}-2})$个样本找到$\epsilon$-最优策略,其中$\alpha\in(0,1]$表示环境的平滑程度。与以往工作相比,我们的结果表明NPMD能利用状态空间的低维结构摆脱维度灾难,为深度策略算法的有效性提供了理论解释。