To make reinforcement learning more sample efficient, we need better credit assignment methods that measure an action's influence on future rewards. Building upon Hindsight Credit Assignment (HCA), we introduce Counterfactual Contribution Analysis (COCOA), a new family of model-based credit assignment algorithms. Our algorithms achieve precise credit assignment by measuring the contribution of actions upon obtaining subsequent rewards, by quantifying a counterfactual query: "Would the agent still have reached this reward if it had taken another action?". We show that measuring contributions w.r.t. rewarding states, as is done in HCA, results in spurious estimates of contributions, causing HCA to degrade towards the high-variance REINFORCE estimator in many relevant environments. Instead, we measure contributions w.r.t. rewards or learned representations of the rewarding objects, resulting in gradient estimates with lower variance. We run experiments on a suite of problems specifically designed to evaluate long-term credit assignment capabilities. By using dynamic programming, we measure ground-truth policy gradients and show that the improved performance of our new model-based credit assignment methods is due to lower bias and variance compared to HCA and common baselines. Our results demonstrate how modeling action contributions towards rewarding outcomes can be leveraged for credit assignment, opening a new path towards sample-efficient reinforcement learning.
翻译:为提高强化学习的样本效率,我们需要更好的信用分配方法来衡量动作对未来奖励的影响。基于事后信用分配(HCA),我们提出反事实贡献分析(COCOA),一种新的基于模型的信用分配算法族。我们的算法通过量化反事实查询:“如果代理选择了另一个动作,它还能得到这个奖励吗?”来衡量动作在获取后续奖励中的贡献,从而实现精确的信用分配。研究表明,像HCA那样以奖励状态为参考衡量贡献会产生虚假的贡献估计,导致在许多相关环境中HCA退化为高方差的REINFORCE估计器。相反,我们以奖励或奖励对象的习得表征为参考衡量贡献,从而获得更低方差的梯度估计。我们在专为评估长期信用分配能力设计的一系列问题上进行实验。通过动态规划,我们测量真实策略梯度,结果表明,新型基于模型的信用分配方法性能提升源于相比HCA和常见基线具有更低的偏差与方差。我们的研究展示了如何利用动作对奖励结果的贡献建模来实现信用分配,为样本高效的强化学习开辟新路径。