Exploration and reward specification are fundamental and intertwined challenges for reinforcement learning. Solving sequential decision-making tasks requiring expansive exploration requires either careful design of reward functions or the use of novelty-seeking exploration bonuses. Human supervisors can provide effective guidance in the loop to direct the exploration process, but prior methods to leverage this guidance require constant synchronous high-quality human feedback, which is expensive and impractical to obtain. In this work, we present a technique called Human Guided Exploration (HuGE), which uses low-quality feedback from non-expert users that may be sporadic, asynchronous, and noisy. HuGE guides exploration for reinforcement learning not only in simulation but also in the real world, all without meticulous reward specification. The key concept involves bifurcating human feedback and policy learning: human feedback steers exploration, while self-supervised learning from the exploration data yields unbiased policies. This procedure can leverage noisy, asynchronous human feedback to learn policies with no hand-crafted reward design or exploration bonuses. HuGE is able to learn a variety of challenging multi-stage robotic navigation and manipulation tasks in simulation using crowdsourced feedback from non-expert users. Moreover, this paradigm can be scaled to learning directly on real-world robots, using occasional, asynchronous feedback from human supervisors.
翻译:探索与奖励规范是强化学习中相互关联的基础挑战。要解决需要大规模探索的顺序决策任务,要么精心设计奖励函数,要么利用新异驱动的探索奖励。人类监督者可以在循环中提供有效的引导来指导探索过程,但先前利用这种引导的方法需要持续同步的高质量人类反馈,这种反馈成本高昂且难以获取。本文提出一种名为人类引导探索(HuGE)的技术,该技术利用非专家用户可能零散、异步且带有噪声的低质量反馈。HuGE不仅在仿真环境中引导强化学习探索,还能在现实世界中应用,无需精确的奖励设计。其核心概念在于将人类反馈与策略学习分离:人类反馈引导探索方向,而探索数据的自监督学习则产生无偏策略。这一过程能够利用带噪声的异步人类反馈来学习策略,无需手工设计的奖励或探索奖励。HuGE通过众包非专家用户的反馈,在仿真中学会了各种具有挑战性的多阶段机器人导航与操作任务。此外,该范式可扩展至直接在真实机器人上进行学习,仅需人类监督者提供偶发、异步的反馈。