While on-policy algorithms are known for their stability, they often demand a substantial number of samples. In contrast, off-policy algorithms, which leverage past experiences, are considered sample-efficient but tend to exhibit instability. Can we develop an algorithm that harnesses the benefits of off-policy data while maintaining stable learning? In this paper, we introduce an actor-critic learning framework that harmonizes two data sources for both evaluation and control, facilitating rapid learning and adaptable integration with on-policy algorithms. This framework incorporates variance reduction mechanisms, including a unified advantage estimator (UAE) and a residual baseline, improving the efficacy of both on- and off-policy learning. Our empirical results showcase substantial enhancements in sample efficiency for on-policy algorithms, effectively bridging the gap to the off-policy approaches. It demonstrates the promise of our approach as a novel learning paradigm.
翻译:尽管在线策略算法以其稳定性著称,但通常需要大量样本。相比之下,利用过去经验的离线策略算法被认为具有样本高效性,但往往表现出不稳定性。我们能否开发一种算法,既能利用离线策略数据的优势,又能保持稳定的学习?在本文中,我们提出了一种演员-评论家学习框架,该框架协调两种数据源用于评估和控制,促进快速学习并与在线策略算法灵活集成。该框架融合了方差缩减机制,包括统一优势估计器(UAE)和残差基线,提升了在线和离线策略学习的效率。我们的实证结果展示了在线策略算法在样本效率上的显著提升,有效缩小了与离线策略方法的差距。这表明我们的方法作为一种新型学习范式具有广阔前景。