Preference based Reinforcement Learning (PbRL) removes the need to hand specify a reward function by learning a reward from preference feedback over policy behaviors. Current approaches to PbRL do not address the credit assignment problem inherent in determining which parts of a behavior most contributed to a preference, which result in data intensive approaches and subpar reward functions. We address such limitations by introducing a credit assignment strategy (Hindsight PRIOR) that uses a world model to approximate state importance within a trajectory and then guides rewards to be proportional to state importance through an auxiliary predicted return redistribution objective. Incorporating state importance into reward learning improves the speed of policy learning, overall policy performance, and reward recovery on both locomotion and manipulation tasks. For example, Hindsight PRIOR recovers on average significantly (p<0.05) more reward on MetaWorld (20%) and DMC (15%). The performance gains and our ablations demonstrate the benefits even a simple credit assignment strategy can have on reward learning and that state importance in forward dynamics prediction is a strong proxy for a state's contribution to a preference decision. Code repository can be found at https://github.com/apple/ml-rlhf-hindsight-prior.
翻译:摘要:基于偏好的强化学习(PbRL)通过从策略行为的偏好反馈中学习奖赏函数,消除了手动指定奖赏函数的必要性。当前PbRL方法未能解决确定行为中哪些部分对偏好贡献最大的信用分配问题,导致数据密集型的次优奖赏函数。我们通过引入一种信用分配策略(后见优先)来解决这些限制,该策略利用世界模型近似轨迹内的状态重要性,并通过辅助的预测回报重分布目标引导奖赏与状态重要性成比例。将状态重要性纳入奖赏学习可提升策略学习速度、整体策略性能以及奖赏恢复效果,在运动操控任务中均得到验证。例如,后见优先在MetaWorld(20%)和DMC(15%)任务中平均显著(p<0.05)恢复更多奖赏。性能增益及消融实验表明,即使简单的信用分配策略也能为奖赏学习带来显著效益,且前向动力学预测中的状态重要性是偏好决策中状态贡献度的有力代理指标。代码仓库参见:https://github.com/apple/ml-rlhf-hindsight-prior。