We consider the problem of contextual bandits and imitation learning, where the learner lacks direct knowledge of the executed action's reward. Instead, the learner can actively query an expert at each round to compare two actions and receive noisy preference feedback. The learner's objective is two-fold: to minimize the regret associated with the executed actions, while simultaneously, minimizing the number of comparison queries made to the expert. In this paper, we assume that the learner has access to a function class that can represent the expert's preference model under appropriate link functions, and provide an algorithm that leverages an online regression oracle with respect to this function class for choosing its actions and deciding when to query. For the contextual bandit setting, our algorithm achieves a regret bound that combines the best of both worlds, scaling as $O(\min\{\sqrt{T}, d/\Delta\})$, where $T$ represents the number of interactions, $d$ represents the eluder dimension of the function class, and $\Delta$ represents the minimum preference of the optimal action over any suboptimal action under all contexts. Our algorithm does not require the knowledge of $\Delta$, and the obtained regret bound is comparable to what can be achieved in the standard contextual bandits setting where the learner observes reward signals at each round. Additionally, our algorithm makes only $O(\min\{T, d^2/\Delta^2\})$ queries to the expert. We then extend our algorithm to the imitation learning setting, where the learning agent engages with an unknown environment in episodes of length $H$ each, and provide similar guarantees for regret and query complexity. Interestingly, our algorithm for imitation learning can even learn to outperform the underlying expert, when it is suboptimal, highlighting a practical benefit of preference-based feedback in imitation learning.
翻译:我们研究上下文老虎机和模仿学习问题,其中学习者缺乏对执行动作奖励的直接了解。取而代之的是,学习者可以在每一轮主动向专家查询,比较两个动作并接收带有噪声的偏好反馈。学习者的目标有两个:最小化与执行动作相关的遗憾,同时最小化向专家发出的比较查询次数。在本文中,我们假设学习者可以访问一个函数类,该函数类能够在适当的链接函数下表示专家的偏好模型,并提供一种算法,该算法利用针对此函数类的在线回归预测器来选择其动作并决定何时进行查询。在上下文老虎机设置中,我们的算法实现了结合两方面优势的遗憾界,其规模为 $O(\min\{\sqrt{T}, d/\Delta\})$,其中 $T$ 表示交互次数,$d$ 表示函数类的埃卢德维度,$\Delta$ 表示所有上下文下最优动作相对于任何次优动作的最小偏好。我们的算法不需要知道 $\Delta$,且所获得的遗憾界可与标准上下文老虎机设置(学习者每轮观察到奖励信号)中的结果相媲美。此外,我们的算法仅向专家发出 $O(\min\{T, d^2/\Delta^2\})$ 次查询。随后我们将算法扩展到模仿学习设置,其中学习代理每集与环境进行长度为 $H$ 的交互,并给出类似的遗憾和查询复杂度保证。有趣的是,当专家存在次优性时,我们的模仿学习算法甚至能够学会超越底层专家,这凸显了模仿学习中基于偏好反馈的实际优势。