Modern policy optimization methods in applied reinforcement learning, such as Trust Region Policy Optimization and Policy Mirror Descent, are often based on the policy gradient framework. While theoretical guarantees have been established for this class of algorithms, particularly in the tabular setting, the use of a general parametrization scheme remains mostly unjustified. In this work, we introduce a novel framework for policy optimization based on mirror descent that naturally accommodates general parametrizations. The policy class induced by our scheme recovers known classes, e.g. softmax, and it generates new ones, depending on the choice of the mirror map. For a general mirror map and parametrization class, we establish the quasi-monotonicity of the updates in value function, global linear convergence rates, and we bound the total expected Bregman divergence of the algorithm along its path. To showcase the ability of our framework to accommodate general parametrization schemes, we present a case study involving shallow neural networks.
翻译:现代应用强化学习中的策略优化方法,例如信任区域策略优化和策略镜像下降,通常基于策略梯度框架。尽管这类算法(尤其在表格设置中)已建立理论保证,但通用参数化方案的应用仍缺乏充分论证。本文提出了一种基于镜像下降的策略优化新框架,该框架自然适用于通用参数化。该方案诱导的策略类既能恢复已知类别(如softmax),也能根据镜像映射的选择生成新类别。针对通用镜像映射与参数化类,我们证明了值函数更新的拟单调性、全局线性收敛速率,并约束了算法路径上的总期望布雷格曼散度。为展示该框架容纳通用参数化方案的能力,我们以浅层神经网络为例进行了案例研究。