Modern policy optimization methods in applied reinforcement learning are often inspired by the trust region policy optimization algorithm, which can be interpreted as a particular instance of policy mirror descent. While theoretical guarantees have been established for this framework, particularly in the tabular setting, the use of a general parametrization scheme remains mostly unjustified. In this work, we introduce a novel framework for policy optimization based on mirror descent that naturally accommodates general parametrizations. The policy class induced by our scheme recovers known classes, e.g. tabular softmax, log-linear, and neural policies. It also generates new ones, depending on the choice of the mirror map. For a general mirror map and parametrization function, we establish the quasi-monotonicity of the updates in value function, global linear convergence rates, and we bound the total variation of the algorithm along its path. To showcase the ability of our framework to accommodate general parametrization schemes, we present a case study involving shallow neural networks.
翻译:现代应用强化学习中的策略优化方法常受信任区域策略优化算法启发,该算法可视为策略镜像下降的一个特例。尽管该框架(尤其在表格设定下)已建立了理论保证,但其在一般参数化方案下的适用性仍缺乏充分论证。本文提出一种基于镜像下降的新型策略优化框架,能自然兼容一般参数化形式。该方案诱导的策略类可恢复已知类别(如表格softmax、对数线性策略和神经网络策略),同时根据镜像映射的选择生成新策略类。针对一般镜像映射与参数化函数,我们证明了值函数更新的拟单调性、全局线性收敛速率,并界定了算法路径上的总变差。为展示本框架对一般参数化方案的兼容能力,我们以浅层神经网络为例进行了案例研究。