STARC: A General Framework For Quantifying Differences Between Reward Functions

In order to solve a task using reinforcement learning, it is necessary to first formalise the goal of that task as a reward function. However, for many real-world tasks, it is very difficult to manually specify a reward function that never incentivises undesirable behaviour. As a result, it is increasingly popular to use \emph{reward learning algorithms}, which attempt to \emph{learn} a reward function from data. However, the theoretical foundations of reward learning are not yet well-developed. In particular, it is typically not known when a given reward learning algorithm with high probability will learn a reward function that is safe to optimise. This means that reward learning algorithms generally must be evaluated empirically, which is expensive, and that their failure modes are difficult to anticipate in advance. One of the roadblocks to deriving better theoretical guarantees is the lack of good methods for quantifying the difference between reward functions. In this paper we provide a solution to this problem, in the form of a class of pseudometrics on the space of all reward functions that we call STARC (STAndardised Reward Comparison) metrics. We show that STARC metrics induce both an upper and a lower bound on worst-case regret, which implies that our metrics are tight, and that any metric with the same properties must be bilipschitz equivalent to ours. Moreover, we also identify a number of issues with reward metrics proposed by earlier works. Finally, we evaluate our metrics empirically, to demonstrate their practical efficacy. STARC metrics can be used to make both theoretical and empirical analysis of reward learning algorithms both easier and more principled.

翻译：为了通过强化学习解决任务，首先需要将任务目标形式化为奖励函数。然而，对于许多现实任务而言，人工指定一个永不会激励不良行为的奖励函数极具挑战性。因此，使用试图从数据中“学习”奖励函数的奖励学习算法日益流行。然而，奖励学习的理论基础尚不完善。特别地，我们通常无法确知给定奖励学习算法以高概率学习到的奖励函数是否可安全优化。这意味着奖励学习算法通常需通过昂贵的经验性评估，且其失效模式难以预先预测。提升理论保证的障碍之一是缺乏量化奖励函数间差异的有效方法。本文提供了该问题的解决方案，即一类定义在全体奖励函数空间上的伪度量，我们称之为STARC（标准化奖励比较）度量。我们证明STARC度量既能约束最坏情况遗憾的上界，也能约束其下界，表明该度量具有紧致性，且任何具有相同性质的度量必与STARC度量是双利普希茨等价的。此外，我们指出了先前研究中提出的奖励度量存在若干问题。最后，我们通过经验性评估验证了STARC度量的实际效用。该度量可使奖励学习算法的理论与经验分析更简便且更规范化。