Common policy gradient methods rely on the maximization of a sequence of surrogate functions. In recent years, many such surrogate functions have been proposed, most without strong theoretical guarantees, leading to algorithms such as TRPO, PPO or MPO. Rather than design yet another surrogate function, we instead propose a general framework (FMA-PG) based on functional mirror ascent that gives rise to an entire family of surrogate functions. We construct surrogate functions that enable policy improvement guarantees, a property not shared by most existing surrogate functions. Crucially, these guarantees hold regardless of the choice of policy parameterization. Moreover, a particular instantiation of FMA-PG recovers important implementation heuristics (e.g., using forward vs reverse KL divergence) resulting in a variant of TRPO with additional desirable properties. Via experiments on simple bandit problems, we evaluate the algorithms instantiated by FMA-PG. The proposed framework also suggests an improved variant of PPO, whose robustness and efficiency we empirically demonstrate on the MuJoCo suite.
翻译:常见的策略梯度方法依赖于最大化一系列替代函数。近年来,研究者提出了多种此类替代函数,但大多数缺乏严格的理论保证,并催生了TRPO、PPO和MPO等算法。不同于设计新型替代函数,我们提出基于函数镜像上升的通用框架FMA-PG,该框架可衍生出完整的替代函数族。我们构建的替代函数能够提供策略改进保证,这是现有大多数替代函数所不具备的特性。至关重要的是,无论策略参数化方式如何选择,这些保证均成立。此外,FMA-PG的特定实例可恢复重要的实现启发式策略(例如采用前向与反向KL散度),从而生成具有额外理想特性的TRPO变体。通过在简单赌博机问题上的实验,我们评估了由FMA-PG实例化的算法。该框架还提出了一种改进型PPO算法,我们在MuJoCo套件上实证展示了其鲁棒性和高效性。