The effectiveness of reinforcement learning (RL) agents in continuous control robotics tasks is heavily dependent on the design of the underlying reward function. However, a misalignment between the reward function and user intentions, values, or social norms can be catastrophic in the real world. Current methods to mitigate this misalignment work by learning reward functions from human preferences; however, they inadvertently introduce a risk of reward overoptimization. In this work, we address this challenge by advocating for the adoption of regularized reward functions that more accurately mirror the intended behaviors. We propose a novel concept of reward regularization within the robotic RLHF (RL from Human Feedback) framework, which we refer to as \emph{agent preferences}. Our approach uniquely incorporates not just human feedback in the form of preferences but also considers the preferences of the RL agent itself during the reward function learning process. This dual consideration significantly mitigates the issue of reward function overoptimization in RL. We provide a theoretical justification for the proposed approach by formulating the robotic RLHF problem as a bilevel optimization problem. We demonstrate the efficiency of our algorithm {\ours} in several continuous control benchmarks including DeepMind Control Suite \cite{tassa2018deepmind} and MetaWorld \cite{yu2021metaworld} and high dimensional visual environments, with an improvement of more than 70\% in sample efficiency in comparison to current SOTA baselines. This showcases our approach's effectiveness in aligning reward functions with true behavioral intentions, setting a new benchmark in the field.
翻译:强化学习(RL)智能体在连续控制机器人任务中的有效性高度依赖于底层奖励函数的设计。然而,奖励函数与用户意图、价值观或社会规范之间的错位可能在实际应用中导致灾难性后果。当前缓解这一错位的方法通过从人类偏好中学习奖励函数实现,但这会无意中引入奖励过度优化的风险。在本工作中,我们通过倡导采用能更准确反映预期行为的正则化奖励函数来应对这一挑战。我们提出了一种机器人RLHF(基于人类反馈的强化学习)框架下的奖励正则化新概念,称之为**智能体偏好**。我们的方法独特地将人类反馈形式的偏好与RL智能体自身的偏好共同纳入奖励函数学习过程。这种双重考量显著缓解了RL中奖励函数的过度优化问题。通过将机器人RLHF问题表述为双层优化问题,我们为该方法的理论合理性提供了证明。我们在多个连续控制基准(包括DeepMind Control Suite和MetaWorld)及高维视觉环境中验证了算法{\ours}的效率——与当前最先进基线方法相比,样本效率提升超过70%。这展示了我们方法在使奖励函数与真实行为意图对齐方面的有效性,为该领域设立了新标杆。