Preference Based Reinforcement Learning has shown much promise for utilizing human binary feedback on queried trajectory pairs to recover the underlying reward model of the Human in the Loop (HiL). While works have attempted to better utilize the queries made to the human, in this work we make two observations about the unlabeled trajectories collected by the agent and propose two corresponding loss functions that ensure participation of unlabeled trajectories in the reward learning process, and structure the embedding space of the reward model such that it reflects the structure of state space with respect to action distances. We validate the proposed method on one locomotion domain and one robotic manipulation task and compare with the state-of-the-art baseline PEBBLE. We further present an ablation of the proposed loss components across both the domains and find that not only each of the loss components perform better than the baseline, but the synergic combination of the two has much better reward recovery and human feedback sample efficiency.
翻译:基于偏好的强化学习在利用人类对查询轨迹对的二元反馈来恢复人在回路中的潜在奖励模型方面展现了巨大潜力。尽管已有研究尝试更有效地利用对人类发起的查询,但本文提出两个关于智能体收集的未标注轨迹的新发现,并据此提出两种相应的损失函数:其一确保未标注轨迹参与奖励学习过程,其二构建奖励模型的嵌入空间,使其能反映状态空间相对于动作距离的结构特征。我们在一个运动控制领域和一个机器人操作任务中验证了所提方法,并与当前最先进的基线模型PEBBLE进行对比。进一步对两个领域的损失组件进行消融实验后发现:不仅每个损失组件单独表现优于基线,而且两者的协同组合在奖励恢复效果及人类反馈样本效率方面均实现显著提升。