The predominant approach in reinforcement learning is to assign credit to actions based on the expected return. However, we show that the return may depend on the policy in a way which could lead to excessive variance in value estimation and slow down learning. Instead, we show that the advantage function can be interpreted as causal effects and shares similar properties with causal representations. Based on this insight, we propose Direct Advantage Estimation (DAE), a novel method that can model the advantage function and estimate it directly from on-policy data while simultaneously minimizing the variance of the return without requiring the (action-)value function. We also relate our method to Temporal Difference methods by showing how value functions can be seamlessly integrated into DAE. The proposed method is easy to implement and can be readily adapted by modern actor-critic methods. We evaluate DAE empirically on three discrete control domains and show that it can outperform generalized advantage estimation (GAE), a strong baseline for advantage estimation, on a majority of the environments when applied to policy optimization.
翻译:强化学习中的主流方法基于期望回报对动作进行信用分配。然而,我们证明回报对策略的依赖方式可能导致价值估计方差过大并减缓学习速度。本文表明优势函数可被解释为因果效应,且与因果表示具有相似性质。基于这一洞见,我们提出直接优势估计(DAE)方法——该新型方法能对优势函数进行建模,在无需(动作)价值函数的情况下,直接从在线策略数据中估计优势函数,同时最小化回报方差。我们进一步通过展示价值函数如何无缝融入DAE框架,将其与时序差分方法建立联系。所提方法易于实现,可被现代演员-评论员方法直接适配。在三个离散控制领域的实证评估表明,当应用于策略优化时,DAE在多数环境中优于优势估计的强基线方法——广义优势估计(GAE)。