Reward learning algorithms utilize human feedback to infer a reward function, which is then used to train an AI system. This human feedback is often a preference comparison, in which the human teacher compares several samples of AI behavior and chooses which they believe best accomplishes the objective. While reward learning typically assumes that all feedback comes from a single teacher, in practice these systems often query multiple teachers to gather sufficient training data. In this paper, we investigate this disparity, and find that algorithmic evaluation of these different sources of feedback facilitates more accurate and efficient reward learning. We formally analyze the value of information (VOI) when reward learning from teachers with varying levels of rationality, and define and evaluate an algorithm that utilizes this VOI to actively select teachers to query for feedback. Surprisingly, we find that it is often more informative to query comparatively irrational teachers. By formalizing this problem and deriving an analytical solution, we hope to facilitate improvement in reward learning approaches to aligning AI behavior with human values.
翻译:奖励学习算法利用人类反馈来推断奖励函数,进而用于训练人工智能系统。这种人类反馈通常以偏好比较的形式呈现:人类教师会对比几组人工智能行为样本,并选择他们认为最符合目标的那一组。尽管奖励学习通常假设所有反馈均来自单一教师,但在实际应用中,这些系统往往需要向多位教师查询以收集充足的训练数据。本文针对这一差异展开研究,发现对不同来源的反馈进行算法评估有助于实现更精准、高效的奖励学习。我们基于信息价值(VOI)理论,对从理性程度各异的教师处获取奖励学习反馈进行了形式化分析,并定义与评估了一种利用该信息价值主动选择教师进行反馈查询的算法。令人意外的是,我们发现查询相对非理性的教师往往能带来更多信息量。通过形式化该问题并推导解析解,我们期望推动奖励学习方法在提升人工智能行为与人类价值观一致性方面的改进。