STARC: A General Framework For Quantifying Differences Between Reward Functions

In order to solve a task using reinforcement learning, it is necessary to first formalise the goal of that task as a reward function. However, for many real-world tasks, it is very difficult to manually specify a reward function that never incentivises undesirable behaviour. As a result, it is increasingly popular to use reward learning algorithms, which attempt to learn a reward function from data. However, the theoretical foundations of reward learning are not yet well-developed. In particular, it is typically not known when a given reward learning algorithm with high probability will learn a reward function that is safe to optimise. This means that reward learning algorithms generally must be evaluated empirically, which is expensive, and that their failure modes are difficult to predict in advance. One of the roadblocks to deriving better theoretical guarantees is the lack of good methods for quantifying the difference between reward functions. In this paper we provide a solution to this problem, in the form of a class of pseudometrics on the space of all reward functions that we call STARC (STAndardised Reward Comparison) metrics. We show that STARC metrics induce both an upper and a lower bound on worst-case regret, which implies that our metrics are tight, and that any metric with the same properties must be bilipschitz equivalent to ours. Moreover, we also identify a number of issues with reward metrics proposed by earlier works. Finally, we evaluate our metrics empirically, to demonstrate their practical efficacy. STARC metrics can be used to make both theoretical and empirical analysis of reward learning algorithms both easier and more principled.

翻译：在强化学习中求解任务时，首先需要将任务的目标形式化为奖励函数。然而，对于许多现实任务而言，手动指定一个绝不会激励不良行为的奖励函数非常困难。因此，奖励学习算法（试图从数据中学习奖励函数的方法）日益流行。但奖励学习的理论基础尚不完善：通常难以确定，给定奖励学习算法以高概率学习到的奖励函数是否安全可优化。这意味着奖励学习算法通常必须通过代价高昂的实证方法进行评估，且其失效模式难以提前预测。阻碍获得更优理论保证的障碍之一，是缺乏量化奖励函数差异的有效方法。本文提出了一类在全体奖励函数空间上的伪度量，称为STARC（标准化奖励比较）度量，以解决该问题。我们证明STARC度量既能诱导最坏情况遗憾的上界也能诱导其下界，这表明该度量具有紧致性，且任何具有相同性质的度量都必须与我们的度量双Lipschitz等价。此外，我们还发现了先前工作中提出的奖励度量存在的若干问题。最后，我们通过实证评估验证了STARC度量的实际有效性。STARC度量可使奖励学习算法的理论与实证分析更简单、更具原理性。