Preference-based Reinforcement Learning (PbRL) avoids the need for reward engineering by harnessing human preferences as the reward signal. However, current PbRL algorithms over-reliance on high-quality feedback from domain experts, which results in a lack of robustness. In this paper, we present RIME, a robust PbRL algorithm for effective reward learning from noisy preferences. Our method incorporates a sample selection-based discriminator to dynamically filter denoised preferences for robust training. To mitigate the accumulated error caused by incorrect selection, we propose to warm start the reward model, which additionally bridges the performance gap during transition from pre-training to online training in PbRL. Our experiments on robotic manipulation and locomotion tasks demonstrate that RIME significantly enhances the robustness of the current state-of-the-art PbRL method. Ablation studies further demonstrate that the warm start is crucial for both robustness and feedback-efficiency in limited-feedback cases.
翻译:基于偏好的强化学习(PbRL)通过利用人类偏好作为奖励信号,避免了奖励工程的需求。然而,当前的PbRL算法过度依赖领域专家提供的高质量反馈,导致其缺乏鲁棒性。本文提出RIME,一种从含噪偏好中有效学习奖励的鲁棒PbRL算法。该方法结合基于样本选择的判别器,动态过滤降噪后的偏好以实现鲁棒训练。为缓解错误选择导致的累积误差,我们提出对奖励模型进行暖启动,从而弥补PbRL从预训练阶段过渡到在线训练阶段的性能差距。在机器人操作和移动任务上的实验表明,RIME显著增强了当前最先进PbRL方法的鲁棒性。消融研究进一步证明,在有限反馈场景下,暖启动对鲁棒性和反馈效率均至关重要。