On-policy algorithms are supposed to be stable, however, sample-intensive yet. Off-policy algorithms utilizing past experiences are deemed to be sample-efficient, nevertheless, unstable in general. Can we design an algorithm that can employ the off-policy data, while exploit the stable learning by sailing along the course of the on-policy walkway? In this paper, we present an actor-critic learning framework that borrows the distributional perspective of interest to evaluate, and cross-breeds two sources of the data for policy improvement, which enables fast learning and can be applied to a wide class of algorithms. In its backbone, the variance reduction mechanisms, such as unified advantage estimator (UAE), that extends generalized advantage estimator (GAE) to be applicable on any state-dependent baseline, and a learned baseline, that is competent to stabilize the policy gradient, are firstly put forward to not merely be a bridge to the action-value function but also distill the advantageous learning signal. Lastly, it is empirically shown that our method improves sample efficiency and interpolates different levels well. Being of an organic whole, its mixture places more inspiration to the algorithm design.
翻译:同策略算法本应稳定,但样本需求量大。利用过往经验的异策略算法被认为样本效率高,但通常不稳定。我们能否设计一种算法,既能利用异策略数据,又能沿着同策略的路径稳定学习?本文提出了一种演员-评论家学习框架,该框架借鉴了感兴趣的分布视角进行评估,并融合了两种数据来源以改进策略,从而实现了快速学习,并适用于广泛的算法类别。在其核心中,首次提出了方差缩减机制,例如统一优势估计器(UAE),它将广义优势估计器(GAE)扩展到适用于任何状态依赖的基线,以及一种能够稳定策略梯度的学习基线,这些机制不仅充当了与动作-价值函数的桥梁,还提炼了有益的学习信号。最后,实验表明,我们的方法提高了样本效率,并很好地适应了不同水平。作为一个有机整体,其融合为算法设计带来了更多启示。