Modern policy optimization methods in reinforcement learning, such as TRPO and PPO, owe their success to the use of parameterized policies. However, while theoretical guarantees have been established for this class of algorithms, especially in the tabular setting, the use of general parameterization schemes remains mostly unjustified. In this work, we introduce a novel framework for policy optimization based on mirror descent that naturally accommodates general parameterizations. The policy class induced by our scheme recovers known classes, e.g., softmax, and generates new ones depending on the choice of mirror map. Using our framework, we obtain the first result that guarantees linear convergence for a policy-gradient-based method involving general parameterization. To demonstrate the ability of our framework to accommodate general parameterization schemes, we provide its sample complexity when using shallow neural networks, show that it represents an improvement upon the previous best results, and empirically validate the effectiveness of our theoretical claims on classic control tasks.
翻译:现代强化学习中的策略优化方法(如TRPO和PPO)的成功得益于参数化策略的使用。然而,尽管这类算法(尤其在表格场景中)已建立理论保证,但一般参数化方案的使用仍缺乏充分理论依据。本文提出一种基于镜像下降的新型策略优化框架,该框架天然兼容一般参数化方案。该框架诱导的策略类既能恢复已知类别(如softmax),也能根据镜像映射的选择生成新策略类。通过该框架,我们首次证明了基于策略梯度的方法在一般参数化下具有线性收敛性。为验证框架对一般参数化方案的兼容能力,我们给出了使用浅层神经网络时的样本复杂度,表明其相较先前最优结果有所提升,并通过经典控制任务实验验证了理论主张的有效性。