To make reinforcement learning more sample efficient, we need better credit assignment methods that measure an action's influence on future rewards. Building upon Hindsight Credit Assignment (HCA), we introduce Counterfactual Contribution Analysis (COCOA), a new family of model-based credit assignment algorithms. Our algorithms achieve precise credit assignment by measuring the contribution of actions upon obtaining subsequent rewards, by quantifying a counterfactual query: 'Would the agent still have reached this reward if it had taken another action?'. We show that measuring contributions w.r.t. rewarding states, as is done in HCA, results in spurious estimates of contributions, causing HCA to degrade towards the high-variance REINFORCE estimator in many relevant environments. Instead, we measure contributions w.r.t. rewards or learned representations of the rewarding objects, resulting in gradient estimates with lower variance. We run experiments on a suite of problems specifically designed to evaluate long-term credit assignment capabilities. By using dynamic programming, we measure ground-truth policy gradients and show that the improved performance of our new model-based credit assignment methods is due to lower bias and variance compared to HCA and common baselines. Our results demonstrate how modeling action contributions towards rewarding outcomes can be leveraged for credit assignment, opening a new path towards sample-efficient reinforcement learning.
翻译:为了提升强化学习的样本效率,我们需要更好的信用分配方法来衡量动作对未来奖励的影响。基于后见信用分配(HCA),我们提出反事实贡献分析(COCOA),一种新的基于模型的信用分配算法族。我们的算法通过量化反事实查询来精确测量动作对后续奖励的贡献程度:"如果代理选择了其他动作,它是否还能获得这个奖励?"。我们证明,像HCA那样测量相对于奖励状态的贡献会导致虚假的贡献估计,使得HCA在许多相关环境中退化为高方差的REINFORCE估计器。相反,我们测量相对于奖励或奖励对象学习表示的贡献,从而得到方差更低的梯度估计。我们在专门设计用于评估长期信用分配能力的问题套件上进行实验。通过使用动态规划,我们测量真实策略梯度,并证明与HCA和常见基线相比,我们新提出的基于模型信用分配方法的性能提升源于更低的偏差和方差。我们的研究结果展示了如何利用动作对奖励结果的贡献建模进行信用分配,为样本高效的强化学习开辟了新路径。