The dependency of the actor on the critic in actor-critic (AC) reinforcement learning means that AC can be characterized as a bilevel optimization (BLO) problem, also called a Stackelberg game. This characterization motivates two modifications to vanilla AC algorithms. First, the critic's update should be nested to learn a best response to the actor's policy. Second, the actor should update according to a hypergradient that takes changes in the critic's behavior into account. Computing this hypergradient involves finding an inverse Hessian vector product, a process that can be numerically unstable. We thus propose a new algorithm, Bilevel Policy Optimization with Nyström Hypergradients (BLPO), which uses nesting to account for the nested structure of BLO, and leverages the Nyström method to compute the hypergradient. Theoretically, we prove BLPO converges to (a point that satisfies the necessary conditions for) a local strong Stackelberg equilibrium in polynomial time with high probability, assuming a linear parametrization of the critic's objective. Empirically, we demonstrate that BLPO performs on par with or better than PPO on a variety of discrete and continuous control tasks.
翻译:在演员-评论家(AC)强化学习中,演员对评论家的依赖性意味着AC可以被表征为一个双层优化(BLO)问题,也称为Stackelberg博弈。这一表征促使我们对原始AC算法进行两项改进。首先,评论家的更新应被嵌套,以学习对演员策略的最佳响应。其次,演员应根据一个考虑了评论家行为变化的超梯度进行更新。计算该超梯度涉及求解一个逆Hessian向量积,这一过程可能在数值上不稳定。因此,我们提出了一种新算法——基于Nyström超梯度的双层策略优化(BLPO),该算法利用嵌套结构来应对BLO的嵌套特性,并采用Nyström方法来计算超梯度。理论上,我们证明在假设评论家目标函数为线性参数化的前提下,BLPO能以高概率在多项式时间内收敛至(满足局部强Stackelberg均衡必要条件的)一个点。实证上,我们证明BLPO在各种离散和连续控制任务上的表现与PPO相当或更优。