Preference-based reward learning is a popular technique for teaching robots and autonomous systems how a human user wants them to perform a task. Previous works have shown that actively synthesizing preference queries to maximize information gain about the reward function parameters improves data efficiency. The information gain criterion focuses on precisely identifying all parameters of the reward function. This can potentially be wasteful as many parameters may result in the same reward, and many rewards may result in the same behavior in the downstream tasks. Instead, we show that it is possible to optimize for learning the reward function up to a behavioral equivalence class, such as inducing the same ranking over behaviors, distribution over choices, or other related definitions of what makes two rewards similar. We introduce a tractable framework that can capture such definitions of similarity. Our experiments in a synthetic environment, an assistive robotics environment with domain transfer, and a natural language processing problem with real datasets demonstrate the superior performance of our querying method over the state-of-the-art information gain method.
翻译:基于偏好的奖励学习是一种流行的技术,用于教导机器人及自主系统如何按照人类用户的期望执行任务。先前的研究表明,主动合成偏好查询以最大化关于奖励函数参数的信息获取,能够提升数据效率。然而,信息获取标准侧重于精确识别奖励函数的所有参数,这可能导致资源浪费,因为许多参数可能对应相同的奖励,而相同的奖励又可能在下游任务中导致相同的行为。相反,我们证明可以优化学习至行为等价类别的奖励函数,例如在行为排序、选择分布或其他定义奖励相似性的相关准则上保持一致。我们引入了一个可处理的框架,能够捕捉这类相似性定义。在合成环境、涉及域转移的辅助机器人环境以及使用真实数据集的自然语言处理问题中的实验表明,我们的查询方法在性能上优于最先进的信息获取方法。