Selective Sampling and Imitation Learning via Online Regression

We consider the problem of Imitation Learning (IL) by actively querying noisy expert for feedback. While imitation learning has been empirically successful, much of prior work assumes access to noiseless expert feedback which is not practical in many applications. In fact, when one only has access to noisy expert feedback, algorithms that rely on purely offline data (non-interactive IL) can be shown to need a prohibitively large number of samples to be successful. In contrast, in this work, we provide an interactive algorithm for IL that uses selective sampling to actively query the noisy expert for feedback. Our contributions are twofold: First, we provide a new selective sampling algorithm that works with general function classes and multiple actions, and obtains the best-known bounds for the regret and the number of queries. Next, we extend this analysis to the problem of IL with noisy expert feedback and provide a new IL algorithm that makes limited queries. Our algorithm for selective sampling leverages function approximation, and relies on an online regression oracle w.r.t.~the given model class to predict actions, and to decide whether to query the expert for its label. On the theoretical side, the regret bound of our algorithm is upper bounded by the regret of the online regression oracle, while the query complexity additionally depends on the eluder dimension of the model class. We complement this with a lower bound that demonstrates that our results are tight. We extend our selective sampling algorithm for IL with general function approximation and provide bounds on both the regret and the number of queries made to the noisy expert. A key novelty here is that our regret and query complexity bounds only depend on the number of times the optimal policy (and not the noisy expert, or the learner) go to states that have a small margin.

翻译：我们研究通过主动查询含噪声专家反馈的模仿学习（Imitation Learning, IL）问题。尽管模仿学习在实证中取得了成功，但先前的大多数工作假设能获取无噪声的专家反馈，这在许多实际应用中并不现实。事实上，当仅有含噪声专家反馈可用时，依赖纯离线数据（非交互式IL）的算法需要极其庞大的样本量才能有效。与此相反，本研究提出了一种交互式IL算法，通过选择性采样主动查询含噪声专家反馈。我们的贡献有两个方面：首先，我们提出了一种适用于通用函数类和多动作场景的新型选择性采样算法，在遗憾值和查询次数上取得了当前最优的界。其次，我们将该分析扩展至含噪声专家反馈的IL问题，提出了一种查询次数有限的IL新算法。我们的选择性采样算法利用函数近似，并依赖于针对给定模型类别的在线回归预言机来预测动作，以及决定是否查询专家标签。在理论层面，我们算法的遗憾值上界受限于在线回归预言机的遗憾值，而查询复杂度则额外取决于模型类别的eluder维度。我们通过下界证明该结果是紧致的。我们将选择性采样算法扩展至通用函数近似的IL场景，并给出了针对含噪声专家的遗憾值与查询次数的界。此处的一个关键创新在于，我们的遗憾值与查询复杂度仅取决于最优策略（而非含噪声专家或学习器）访问具有小间隔状态的总次数。