Interactive reinforcement learning has shown promise in learning complex robotic tasks. However, the process can be human-intensive due to the requirement of a large amount of interactive feedback. This paper presents a new method that uses scores provided by humans instead of pairwise preferences to improve the feedback efficiency of interactive reinforcement learning. Our key insight is that scores can yield significantly more data than pairwise preferences. Specifically, we require a teacher to interactively score the full trajectories of an agent to train a behavioral policy in a sparse reward environment. To avoid unstable scores given by humans negatively impacting the training process, we propose an adaptive learning scheme. This enables the learning paradigm to be insensitive to imperfect or unreliable scores. We extensively evaluate our method for robotic locomotion and manipulation tasks. The results show that the proposed method can efficiently learn near-optimal policies by adaptive learning from scores while requiring less feedback compared to pairwise preference learning methods. The source codes are publicly available at https://github.com/SSKKai/Interactive-Scoring-IRL.
翻译:交互式强化学习在复杂机器人任务的学习中已展现出潜力。然而,由于需要大量交互式反馈,该过程可能对人力要求较高。本文提出一种新方法,利用人类提供的分数而非成对偏好,以提高交互式强化学习的反馈效率。我们的关键洞察在于,分数能比成对偏好产生显著更多的数据。具体而言,我们要求教师对智能体的完整轨迹进行交互式评分,以在稀疏奖励环境中训练行为策略。为避免人类给出的不稳定分数对训练过程产生负面影响,我们提出一种自适应学习方案。这使得学习范式对不完美或不可靠的分数不敏感。我们针对机器人 locomotion 和 manipulation 任务进行了广泛评估。结果表明,与成对偏好学习方法相比,所提方法通过从分数中进行自适应学习,能够高效学习接近最优的策略,同时所需反馈更少。源代码已公开于 https://github.com/SSKKai/Interactive-Scoring-IRL。