The utility of reinforcement learning is limited by the alignment of reward functions with the interests of human stakeholders. One promising method for alignment is to learn the reward function from human-generated preferences between pairs of trajectory segments, a type of reinforcement learning from human feedback (RLHF). These human preferences are typically assumed to be informed solely by partial return, the sum of rewards along each segment. We find this assumption to be flawed and propose modeling human preferences instead as informed by each segment's regret, a measure of a segment's deviation from optimal decision-making. Given infinitely many preferences generated according to regret, we prove that we can identify a reward function equivalent to the reward function that generated those preferences, and we prove that the previous partial return model lacks this identifiability property in multiple contexts. We empirically show that our proposed regret preference model outperforms the partial return preference model with finite training data in otherwise the same setting. Additionally, we find that our proposed regret preference model better predicts real human preferences and also learns reward functions from these preferences that lead to policies that are better human-aligned. Overall, this work establishes that the choice of preference model is impactful, and our proposed regret preference model provides an improvement upon a core assumption of recent research. We have open sourced our experimental code, the human preferences dataset we gathered, and our training and preference elicitation interfaces for gathering a such a dataset.
翻译:强化学习的实用性受限于奖励函数与人类利益相关者目标的对齐程度。一种有前景的对齐方法是从人类对轨迹片段对的偏好中学习奖励函数,这属于人类反馈强化学习(RLHF)的一种形式。通常假设这些人类偏好仅由部分回报(即每个片段的奖励总和)决定。我们发现这一假设存在缺陷,并提出将人类偏好建模为由每个片段的遗憾(即片段相对于最优决策的偏离程度)所决定。我们证明,在根据遗憾生成无限多偏好的情况下,可以识别出等价于生成这些偏好的奖励函数,并证明先前的部分回报模型在多种情境下缺乏这一可识别性。实验表明,在相同设置下,使用有限训练数据时,我们提出的遗憾偏好模型优于部分回报偏好模型。此外,我们发现该模型能更好地预测真实人类偏好,并从这些偏好中学习奖励函数,从而生成更符合人类意图的策略。总体而言,本研究确立了偏好模型选择的重要性,而提出的遗憾偏好模型改进了近期研究的核心假设。我们已开源实验代码、收集的人类偏好数据集,以及用于采集此类数据的训练与偏好获取界面。