STARC: A General Framework For Quantifying Differences Between Reward Functions

In order to solve a task using reinforcement learning, it is necessary to first formalise the goal of that task as a reward function. However, for many real-world tasks, it is very difficult to manually specify a reward function that never incentivises undesirable behaviour. As a result, it is increasingly popular to use reward learning algorithms, which attempt to learn a reward function from data. However, the theoretical foundations of reward learning are not yet well-developed. In particular, it is typically not known when a given reward learning algorithm with high probability will learn a reward function that is safe to optimise. This means that reward learning algorithms generally must be evaluated empirically, which is expensive, and that their failure modes are difficult to anticipate in advance. One of the roadblocks to deriving better theoretical guarantees is the lack of good methods for quantifying the difference between reward functions. In this paper we provide a solution to this problem, in the form of a class of pseudometrics on the space of all reward functions that we call STARC (STAndardised Reward Comparison) metrics. We show that STARC metrics induce both an upper and a lower bound on worst-case regret, which implies that our metrics are tight, and that any metric with the same properties must be bilipschitz equivalent to ours. Moreover, we also identify a number of issues with reward metrics proposed by earlier works. Finally, we evaluate our metrics empirically, to demonstrate their practical efficacy. STARC metrics can be used to make both theoretical and empirical analysis of reward learning algorithms both easier and more principled.

翻译：为了通过强化学习解决任务，首先需要将该任务的目标形式化为奖励函数。然而，对于许多现实世界的任务，很难手动指定一个从不激励不良行为的奖励函数。因此，使用奖励学习算法日益流行，这类算法试图从数据中学习奖励函数。然而，奖励学习的理论基础尚未充分发展。特别是，通常无法确知给定奖励学习算法在何种条件下能以高概率学习到可安全优化的奖励函数。这意味着奖励学习算法通常必须通过实证评估，成本高昂，且其失效模式难以提前预测。获得更好理论保证的障碍之一，是缺乏量化奖励函数间差异的有效方法。本文针对该问题提出一种解决方案，即一类定义在所有奖励函数空间上的伪度量，我们称之为STARC（标准化奖励比较）度量。我们证明STARC度量同时诱导了最差情况遗憾的上界和下界，这意味着我们的度量是紧致的，且任何具有相同性质的度量必与我们的度量双利普希茨等价。此外，我们还指出了早期研究提出的奖励度量存在的若干问题。最后，我们通过实证评估验证了所提度量的实际效能。STARC度量可使奖励学习算法的理论与实证分析变得更简便且更具原则性。