Preference-based Reinforcement Learning (PbRL) circumvents the need for reward engineering by harnessing human preferences as the reward signal. However, current PbRL methods excessively depend on high-quality feedback from domain experts, which results in a lack of robustness. In this paper, we present RIME, a robust PbRL algorithm for effective reward learning from noisy preferences. Our method utilizes a sample selection-based discriminator to dynamically filter out noise and ensure robust training. To counteract the cumulative error stemming from incorrect selection, we suggest a warm start for the reward model, which additionally bridges the performance gap during the transition from pre-training to online training in PbRL. Our experiments on robotic manipulation and locomotion tasks demonstrate that RIME significantly enhances the robustness of the state-of-the-art PbRL method. Code is available at https://github.com/CJReinforce/RIME_ICML2024.
翻译:基于偏好的强化学习(PbRL)通过利用人类偏好作为奖励信号,规避了奖励工程的需求。然而,当前的PbRL方法过度依赖于领域专家提供的高质量反馈,这导致其缺乏鲁棒性。本文提出RIME,一种鲁棒的PbRL算法,用于从噪声偏好中有效学习奖励。我们的方法利用基于样本选择的判别器动态过滤噪声,确保训练的鲁棒性。为应对错误选择导致的累积误差,我们建议对奖励模型进行热启动,这同时弥合了PbRL中从预训练过渡到在线训练时的性能差距。我们在机器人操作与运动任务上的实验表明,RIME显著提升了当前最先进PbRL方法的鲁棒性。代码发布于https://github.com/CJReinforce/RIME_ICML2024。