Modern policy optimization methods in reinforcement learning, such as TRPO and PPO, owe their success to the use of parameterized policies. However, while theoretical guarantees have been established for this class of algorithms, especially in the tabular setting, the use of general parameterization schemes remains mostly unjustified. In this work, we introduce a novel framework for policy optimization based on mirror descent that naturally accommodates general parameterizations. The policy class induced by our scheme recovers known classes, e.g., softmax, and generates new ones depending on the choice of mirror map. Using our framework, we obtain the first result that guarantees linear convergence for a policy-gradient-based method involving general parameterization. To demonstrate the ability of our framework to accommodate general parameterization schemes, we provide its sample complexity when using shallow neural networks, show that it represents an improvement upon the previous best results, and empirically validate the effectiveness of our theoretical claims on classic control tasks.
翻译:现代强化学习中的策略优化方法,例如 TRPO 和 PPO,其成功得益于参数化策略的使用。然而,尽管这类算法(尤其是在表格离散场景下)已建立了理论保证,但通用参数化方案的使用仍缺乏充分的理论依据。本文提出了一种基于镜像下降的策略优化新框架,该框架能够自然地适配通用参数化方案。我们的框架所诱导的策略类既能恢复已知类别(如 softmax),也能根据镜像映射的选择生成新策略类。利用该框架,我们首次证明了涉及通用参数化的策略梯度方法具有线性收敛性。为展示框架适配通用参数化方案的能力,我们在使用浅层神经网络时给出了样本复杂度,表明其优于此前最佳结果,并通过经典控制任务实证验证了理论主张的有效性。