To create useful reinforcement learning (RL) agents, step zero is to design a suitable reward function that captures the nuances of the task. However, reward engineering can be a difficult and time-consuming process. Instead, human-in-the-loop (HitL) RL allows agents to learn reward functions from human feedback. Despite recent successes, many of the HitL RL methods still require numerous human interactions to learn successful reward functions. To improve the feedback efficiency of HitL RL methods (i.e., require less feedback), this paper introduces Sub-optimal Data Pre-training, SDP, an approach that leverages reward-free, sub-optimal data to improve scalar- and preference-based HitL RL algorithms. In SDP, we start by pseudo-labeling all low-quality data with rewards of zero. Through this process, we obtain free reward labels to pre-train our reward model. This pre-training phase provides the reward model a head start in learning, whereby it can identify that low-quality transitions should have a low reward, all without any actual feedback. Through extensive experiments with a simulated teacher, we demonstrate that SDP can significantly improve or achieve competitive performance with state-of-the-art (SOTA) HitL RL algorithms across nine robotic manipulation and locomotion tasks.
翻译:为构建实用的强化学习(RL)智能体,首要步骤是设计能捕捉任务细微差别的合适奖励函数。然而,奖励工程是耗时且困难的过程。相比之下,人在环(HitL)强化学习允许智能体从人类反馈中学习奖励函数。尽管近年取得进展,许多HitL RL方法仍需大量人类交互才能习得有效的奖励函数。为提升HitL RL方法的反馈效率(即减少反馈需求),本文提出次优数据预训练方法SDP——一种利用无奖励、次优数据改进基于标量与偏好的HitL RL算法的技术。SDP首先将所有低质量数据伪标注为零奖励,通过该过程获得免费奖励标签以预训练奖励模型。预训练阶段赋予奖励模型学习先机,使其无需任何真实反馈即可识别低质量过渡应具有低奖励。通过与模拟教师的大量实验证明,SDP在九个机器人操作与运动任务中能显著提升或达到与当前最先进(SOTA)HitL RL算法相当的性能。