Reinforcement Learning has suffered from poor reward specification, and issues for reward hacking even in simple enough domains. Preference Based Reinforcement Learning attempts to solve the issue by utilizing binary feedbacks on queried trajectory pairs by a human in the loop indicating their preferences about the agent's behavior to learn a reward model. In this work, we present a state augmentation technique that allows the agent's reward model to be robust and follow an invariance consistency that significantly improved performance, i.e. the reward recovery and subsequent return computed using the learned policy over our baseline PEBBLE. We validate our method on three domains, Mountain Car, a locomotion task of Quadruped-Walk, and a robotic manipulation task of Sweep-Into, and find that using the proposed augmentation the agent not only benefits in the overall performance but does so, quite early in the agent's training phase.
翻译:强化学习一直受困于奖励定义不佳的问题,即使在简单的领域中也存在奖励破解现象。基于偏好的强化学习试图通过利用人类在循环中对查询轨迹对提供的二元反馈来表示其对智能体行为的偏好,从而学习奖励模型来解决这一问题。在本研究中,我们提出了一种状态增强技术,使智能体的奖励模型具有鲁棒性并遵循不变性一致性,从而显著提升了性能,即奖励恢复和后续基于学习策略计算的回报优于我们的基准方法PEBBLE。我们在三个领域上验证了该方法:山地车、四足行走的移动任务以及扫入的机器人操作任务,并发现使用所提出的增强方法后,智能体不仅在整体性能上受益,而且在训练阶段的早期就能实现这一提升。