Specifying reward functions for complex tasks like object manipulation or driving is challenging to do by hand. Reward learning seeks to address this by learning a reward model using human feedback on selected query policies. This shifts the burden of reward specification to the optimal design of the queries. We propose a theoretical framework for studying reward learning and the associated optimal experiment design problem. Our framework models rewards and policies as nonparametric functions belonging to subsets of Reproducing Kernel Hilbert Spaces (RKHSs). The learner receives (noisy) oracle access to a true reward and must output a policy that performs well under the true reward. For this setting, we first derive non-asymptotic excess risk bounds for a simple plug-in estimator based on ridge regression. We then solve the query design problem by optimizing these risk bounds with respect to the choice of query set and obtain a finite sample statistical rate, which depends primarily on the eigenvalue spectrum of a certain linear operator on the RKHSs. Despite the generality of these results, our bounds are stronger than previous bounds developed for more specialized problems. We specifically show that the well-studied problem of Gaussian process (GP) bandit optimization is a special case of our framework, and that our bounds either improve or are competitive with known regret guarantees for the Mat\'ern kernel.
翻译:为诸如物体操控或驾驶等复杂任务手动指定奖励函数极具挑战性。奖励学习旨在通过利用对选定查询策略的人类反馈来学习奖励模型,从而解决这一问题。这将奖励指定的负担转移至查询的最优设计。我们提出一个用于研究奖励学习及相关最优实验设计问题的理论框架。该框架将奖励与策略建模为属于再生核希尔伯特空间(RKHSs)子集的非参数函数。学习器可访问真实奖励的(含噪)预言机,并需输出一个在真实奖励下表现优异的策略。针对此设定,我们首先为基于岭回归的简单插件估计器推导非渐近过界风险界限。随后,通过优化这些风险界限相对于查询集的选择来求解查询设计问题,并获得有限样本统计速率,该速率主要取决于RKHSs上某线性算子的特征值谱。尽管这些结果具有普适性,我们的界限仍比先前针对更专业问题提出的界限更强。我们特别证明了高斯过程(GP)赌博机优化这一被广泛研究的问题是我们框架的特例,且我们的界限对Matérn核的已知遗憾保证要么有所改进,要么具有竞争力。