Reinforcement Learning from Human Feedback (RLHF) has emerged as a popular paradigm for capturing human intent to alleviate the challenges of hand-crafting the reward values. Despite the increasing interest in RLHF, most works learn black box reward functions that while expressive are difficult to interpret and often require running the whole costly process of RL before we can even decipher if these frameworks are actually aligned with human preferences. We propose and evaluate a novel approach for learning expressive and interpretable reward functions from preferences using Differentiable Decision Trees (DDTs). Our experiments across several domains, including CartPole, Visual Gridworld environments and Atari games, provide evidence that the tree structure of our learned reward function is useful in determining the extent to which the reward function is aligned with human preferences. We also provide experimental evidence that not only shows that reward DDTs can often achieve competitive RL performance when compared with larger capacity deep neural network reward functions but also demonstrates the diagnostic utility of our framework in checking alignment of learned reward functions. We also observe that the choice between soft and hard (argmax) output of reward DDT reveals a tension between wanting highly shaped rewards to ensure good RL performance, while also wanting simpler, more interpretable rewards. Videos and code, are available at: https://sites.google.com/view/ddt-rlhf
翻译:基于人类反馈的强化学习(RLHF)已成为捕获人类意图以缓解手工设计奖励值挑战的主流范式。尽管对RLHF的兴趣日益增长,但大多数研究学习的是黑盒奖励函数,这些函数虽然表达能力强,但难以解释,且通常需要运行完整的昂贵强化学习过程才能判断这些框架是否真正符合人类偏好。我们提出并评估了一种使用可微决策树(DDTs)从偏好中学习表达性强且可解释的奖励函数的新方法。我们在多个领域(包括CartPole、视觉网格世界环境和Atari游戏)的实验证明,所学奖励函数的树状结构有助于判断奖励函数与人类偏好的对齐程度。实验证据还表明,奖励DDT不仅在与更大容量的深度神经网络奖励函数比较时能获得具有竞争力的强化学习性能,而且我们的框架在检查所学奖励函数对齐性方面具有诊断效用。我们还观察到,奖励DDT的软输出与硬输出(argmax)选择之间存在张力:既需要高度塑形的奖励以确保良好的强化学习性能,又期望获得更简单、更可解释的奖励。视频和代码详见:https://sites.google.com/view/ddt-rlhf