For sophisticated reinforcement learning (RL) systems to interact usefully with real-world environments, we need to communicate complex goals to these systems. In this work, we explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments. We show that this approach can effectively solve complex RL tasks without access to the reward function, including Atari games and simulated robot locomotion, while providing feedback on less than one percent of our agent's interactions with the environment. This reduces the cost of human oversight far enough that it can be practically applied to state-of-the-art RL systems. To demonstrate the flexibility of our approach, we show that we can successfully train complex novel behaviors with about an hour of human time. These behaviors and environments are considerably more complex than any that have been previously learned from human feedback.
翻译:为了使复杂的强化学习(RL)系统能够有效地与现实环境交互,我们需要向这些系统传达复杂的目标。在本工作中,我们探索了以(非专家)人类在轨迹片段对之间的偏好来定义的目标。我们证明,这种方法无需访问奖励函数即可有效解决复杂的RL任务,包括Atari游戏和模拟机器人运动,同时仅在代理与环境交互的不到百分之一的环节中提供反馈。这显著降低了人类监督的成本,使其能够实际应用于最先进的RL系统。为展示我们方法的灵活性,我们表明,仅需约一小时的人工时间,即可成功训练出复杂的新行为。这些行为和环境远比以往任何通过人类反馈学习到的任务更为复杂。