Actor-critic (AC) methods are widely used in reinforcement learning (RL) and benefit from the flexibility of using any policy gradient method as the actor and value-based method as the critic. The critic is usually trained by minimizing the TD error, an objective that is potentially decorrelated with the true goal of achieving a high reward with the actor. We address this mismatch by designing a joint objective for training the actor and critic in a decision-aware fashion. We use the proposed objective to design a generic, AC algorithm that can easily handle any function approximation. We explicitly characterize the conditions under which the resulting algorithm guarantees monotonic policy improvement, regardless of the choice of the policy and critic parameterization. Instantiating the generic algorithm results in an actor that involves maximizing a sequence of surrogate functions (similar to TRPO, PPO) and a critic that involves minimizing a closely connected objective. Using simple bandit examples, we provably establish the benefit of the proposed critic objective over the standard squared error. Finally, we empirically demonstrate the benefit of our decision-aware actor-critic framework on simple RL problems.
翻译:演员-评论家(AC)方法广泛应用于强化学习(RL),其优势在于可灵活地将任意策略梯度方法作为演员、任意基于价值的方法作为评论家。评论家通常通过最小化时序差分误差进行训练,该目标函数可能与实现演员高奖励的最终目标存在偏差。为解决这一失配问题,我们设计了一种联合目标函数,以决策感知方式训练演员与评论家。利用所提出的目标函数,我们构建了一个通用的AC算法,可轻松处理任意函数逼近形式。我们明确刻画了保证该算法实现单调策略改进的条件,且该条件不依赖于策略与评论家的参数化选择。实例化该通用算法后,演员需最大化一系列代理函数(类似于TRPO、PPO),而评论家则需最小化一个紧密关联的目标函数。通过简单的多臂赌博机实例,我们从理论上证明了所提评论家目标函数相较标准平方误差的优势。最后,我们在简单RL问题上通过实验验证了决策感知演员-评论家框架的有效性。