There is an increasing interest in learning reward functions that model human intent and human preferences. However, many frameworks use blackbox learning methods that, while expressive, are difficult to interpret. We propose and evaluate a novel approach for learning expressive and interpretable reward functions from preferences using Differentiable Decision Trees (DDTs) for both low- and high-dimensional state inputs. We explore and discuss the viability of learning interpretable reward functions using DDTs by evaluating our algorithm on Cartpole, Visual Gridworld environments, and Atari games. We provide evidence that that the tree structure of our learned reward function is useful in determining the extent to which a reward function is aligned with human preferences. We visualize the learned reward DDTs and find that they are capable of learning interpretable reward functions but that the discrete nature of the trees hurts the performance of reinforcement learning at test time. However, we also show evidence that using soft outputs (averaged over all leaf nodes) results in competitive performance when compared with larger capacity deep neural network reward functions.
翻译:学习建模人类意图和人类偏好的奖励函数日益受到关注。然而,许多框架使用黑箱学习方法,这些方法虽然表达能力强大,却难以解释。我们提出并评估了一种新颖的方法,利用可微决策树(DDTs)从偏好中学习既具表达力又具可解释性的奖励函数,该方法适用于低维和高维状态输入。通过在Cartpole、Visual Gridworld环境以及Atari游戏上评估我们的算法,我们探讨并讨论了使用DDTs学习可解释奖励函数的可行性。我们提供的证据表明,所学奖励函数的树结构有助于确定奖励函数与人类偏好对齐的程度。我们对学到的奖励DDTs进行可视化,发现它们能够学习可解释的奖励函数,但树的离散性质在测试时损害了强化学习的性能。然而,我们也展示了证据表明,与具有更大容量的深度神经网络奖励函数相比,使用软输出(所有叶节点的平均值)能够获得具有竞争力的性能。