In this work, we propose REBEL, an algorithm for sample efficient reward regularization based robotic reinforcement learning from human feedback (RRLHF). Reinforcement learning (RL) performance for continuous control robotics tasks is sensitive to the underlying reward function. In practice, the reward function often ends up misaligned with human intent, values, social norms, etc., leading to catastrophic failures in the real world. We leverage human preferences to learn regularized reward functions and eventually align the agents with the true intended behavior. We introduce a novel notion of reward regularization to the existing RRLHF framework, which is termed as agent preferences. So, we not only consider human feedback in terms of preferences, we also propose to take into account the preference of the underlying RL agent while learning the reward function. We show that this helps to improve the over-optimization associated with the design of reward functions in RL. We experimentally show that REBEL exhibits up to 70% improvement in sample efficiency to achieve a similar level of episodic reward returns as compared to the state-of-the-art methods such as PEBBLE and PEBBLE+SURF.
翻译:本文提出REBEL算法,一种基于样本高效奖励正则化的人类反馈机器人强化学习(RRLHF)方法。连续控制机器人任务的强化学习(RL)性能对底层奖励函数高度敏感。实践中,奖励函数常与人类意图、价值观、社会规范等产生偏差,导致现实世界中的灾难性失败。我们利用人类偏好学习正则化奖励函数,最终使智能体与真实目标行为对齐。我们将一种新颖的奖励正则化概念——即智能体偏好——引入现有RRLHF框架。因此,我们不仅考虑以偏好形式呈现的人类反馈,还提出在学习奖励函数时纳入底层RL智能体的偏好。研究表明,这有助于改善RL中奖励函数设计相关的过度优化问题。实验证明,与PEBBLE和PEBBLE+SURF等最先进方法相比,REBEL在达到相似情节奖励回报时,样本效率提升高达70%。