Robots can learn to imitate humans by inferring what the human is optimizing for. One common framework for this is Bayesian reward learning, where the robot treats the human's demonstrations and corrections as observations of their underlying reward function. Unfortunately, this inference is doubly-intractable: the robot must reason over all the trajectories the person could have provided and all the rewards the person could have in mind. Prior work uses existing robotic tools to approximate this normalizer. In this paper, we group previous approaches into three fundamental classes and analyze the theoretical pros and cons of their approach. We then leverage recent research from the statistics community to introduce Double MH reward learning, a Monte Carlo method for asymptotically learning the human's reward in continuous spaces. We extend Double MH to conditionally independent settings (where each human correction is viewed as completely separate) and conditionally dependent environments (where the human's current correction may build on previous inputs). Across simulations and user studies, our proposed approach infers the human's reward parameters more accurately than the alternate approximations when learning from either demonstrations or corrections. See videos here: https://youtu.be/EkmT3o5K5ko
翻译:机器人可以通过推断人类优化的目标来学习模仿人类。一种常见的框架是贝叶斯奖励学习,其中机器人将人类的演示和修正视为对其潜在奖励函数的观测。不幸的是,这种推理是双重困难的:机器人必须考虑人类可能提供的所有轨迹以及人类可能考虑的所有奖励。先前的工作利用现有的机器人工具来近似这个归一化项。在本文中,我们将先前的方法归为三个基本类别,并分析其方法的理论优缺点。随后,我们利用统计学界的最新研究,提出了双重MH奖励学习方法,这是一种用于在连续空间中渐进学习人类奖励的蒙特卡洛方法。我们将双重MH扩展到条件独立设置(其中每次人类修正被视为完全独立)和条件依赖环境(其中人类当前的修正可能基于先前的输入)。通过模拟和用户研究,我们提出的方法在从演示或修正中学习时,能比替代近似方法更准确地推断人类奖励参数。相关视频请见:https://youtu.be/EkmT3o5K5ko