Implementing a reward function that perfectly captures a complex task in the real world is impractical. As a result, it is often appropriate to think of the reward function as a proxy for the true objective rather than as its definition. We study this phenomenon through the lens of Goodhart's law, which predicts that increasing optimisation of an imperfect proxy beyond some critical point decreases performance on the true objective. First, we propose a way to quantify the magnitude of this effect and show empirically that optimising an imperfect proxy reward often leads to the behaviour predicted by Goodhart's law for a wide range of environments and reward functions. We then provide a geometric explanation for why Goodhart's law occurs in Markov decision processes. We use these theoretical insights to propose an optimal early stopping method that provably avoids the aforementioned pitfall and derive theoretical regret bounds for this method. Moreover, we derive a training method that maximises worst-case reward, for the setting where there is uncertainty about the true reward function. Finally, we evaluate our early stopping method experimentally. Our results support a foundation for a theoretically-principled study of reinforcement learning under reward misspecification.
翻译:在现实世界中实现一个完美捕捉复杂任务的奖励函数是不切实际的。因此,将奖励函数视为真实目标的代理而非其定义通常是恰当的。我们通过古德哈特定律的视角研究这一现象,该定律预测:对不完美代理的优化超过某个临界点后,反而会降低真实目标的性能。首先,我们提出了一种量化该效应强度的方法,并通过实验表明,在广泛的环境和奖励函数中,优化不完美的代理奖励往往会导致古德哈特定律所预测的行为。随后,我们给出了马尔可夫决策过程中古德哈特定律产生的几何解释。利用这些理论洞见,我们提出了一种最优早期停止方法,该方法可证明地避免了上述陷阱,并推导了该方法的理论遗憾界。此外,针对真实奖励函数存在不确定性的场景,我们推导了一种最大化最坏情况奖励的训练方法。最后,我们通过实验评估了早期停止方法的性能。我们的研究结果为在奖励误设条件下进行具有理论基础的强化学习研究奠定了基础。