Learning policies via preference-based reward learning is an increasingly popular method for customizing agent behavior, but has been shown anecdotally to be prone to spurious correlations and reward hacking behaviors. While much prior work focuses on causal confusion in reinforcement learning and behavioral cloning, we focus on a systematic study of causal confusion and reward misidentification when learning from preferences. In particular, we perform a series of sensitivity and ablation analyses on several benchmark domains where rewards learned from preferences achieve minimal test error but fail to generalize to out-of-distribution states -- resulting in poor policy performance when optimized. We find that the presence of non-causal distractor features, noise in the stated preferences, and partial state observability can all exacerbate reward misidentification. We also identify a set of methods with which to interpret misidentified learned rewards. In general, we observe that optimizing misidentified rewards drives the policy off the reward's training distribution, resulting in high predicted (learned) rewards but low true rewards. These findings illuminate the susceptibility of preference learning to reward misidentification and causal confusion -- failure to consider even one of many factors can result in unexpected, undesirable behavior.
翻译:通过基于偏好的奖励学习来制定策略,是一种日益流行的定制智能体行为的方法,但已有实例证据表明,该方法容易受到虚假相关性和奖励破解行为的影响。尽管此前大量研究聚焦于强化学习和行为克隆中的因果混淆,但本文系统地研究了从偏好中学习时的因果混淆与奖励误识别问题。具体而言,我们在多个基准领域进行了一系列敏感性分析和消融分析,发现在这些领域中,通过偏好学习得到的奖励虽在测试集上误差极小,却无法泛化至分布外状态——进而导致策略优化后表现不佳。研究结果表明,非因果干扰特征的存在、给定偏好中的噪声以及部分状态可观测性均会加剧奖励误识别。此外,我们确定了一系列可用于解释误识别学习奖励的方法。总体而言,我们观察到优化误识别奖励会促使策略偏离该奖励的训练分布,导致预测(学习)的奖励值偏高而真实奖励值偏低。这些发现揭示了偏好学习易受奖励误识别和因果混淆影响的特点——即使忽略众多因素中的一个,也可能引发意外且非期望的行为。