Learning policies via preference-based reward learning is an increasingly popular method for customizing agent behavior, but has been shown anecdotally to be prone to spurious correlations and reward hacking behaviors. While much prior work focuses on causal confusion in reinforcement learning and behavioral cloning, we focus on a systematic study of causal confusion and reward misidentification when learning from preferences. In particular, we perform a series of sensitivity and ablation analyses on several benchmark domains where rewards learned from preferences achieve minimal test error but fail to generalize to out-of-distribution states -- resulting in poor policy performance when optimized. We find that the presence of non-causal distractor features, noise in the stated preferences, and partial state observability can all exacerbate reward misidentification. We also identify a set of methods with which to interpret misidentified learned rewards. In general, we observe that optimizing misidentified rewards drives the policy off the reward's training distribution, resulting in high predicted (learned) rewards but low true rewards. These findings illuminate the susceptibility of preference learning to reward misidentification and causal confusion -- failure to consider even one of many factors can result in unexpected, undesirable behavior.
翻译:通过基于偏好的奖励学习来学习策略,是一种日益流行的定制智能体行为的方法,但已有实例表明其容易产生虚假关联和奖励黑客行为。虽然许多先前工作侧重于强化学习和行为克隆中的因果混淆,但我们系统研究了从偏好中学习时的因果混淆与奖励误识别问题。具体而言,我们在多个基准领域进行了一系列敏感性和消融分析,在这些领域中,从偏好中学习到的奖励虽然达到了最小的测试误差,却未能推广到分布外状态——导致优化后的策略性能不佳。我们发现,非因果干扰特征的存在、给定偏好中的噪声以及部分状态可观测性都会加剧奖励误识别。我们还识别出一组可用于解释被误识别的学习奖励的方法。总体而言,我们观察到,优化误识别的奖励会使策略偏离该奖励的训练分布,从而导致预测(学习)奖励较高但真实奖励较低。这些发现揭示了偏好学习易受奖励误识别和因果混淆影响的特点——即使只忽略众多因素中的一个,也可能导致意外且不理想的行为。