Imitation learning (IL) aims to mimic the behavior of an expert in a sequential decision making task by learning from demonstrations, and has been widely applied to robotics, autonomous driving, and autoregressive text generation. The simplest approach to IL, behavior cloning (BC), is thought to incur sample complexity with unfavorable quadratic dependence on the problem horizon, motivating a variety of different online algorithms that attain improved linear horizon dependence under stronger assumptions on the data and the learner's access to the expert. We revisit the apparent gap between offline and online IL from a learning-theoretic perspective, with a focus on the realizable/well-specified setting with general policy classes up to and including deep neural networks. Through a new analysis of behavior cloning with the logarithmic loss, we show that it is possible to achieve horizon-independent sample complexity in offline IL whenever (i) the range of the cumulative payoffs is controlled, and (ii) an appropriate notion of supervised learning complexity for the policy class is controlled. Specializing our results to deterministic, stationary policies, we show that the gap between offline and online IL is smaller than previously thought: (i) it is possible to achieve linear dependence on horizon in offline IL under dense rewards (matching what was previously only known to be achievable in online IL); and (ii) without further assumptions on the policy class, online IL cannot improve over offline IL with the logarithmic loss, even in benign MDPs. We complement our theoretical results with experiments on standard RL tasks and autoregressive language generation to validate the practical relevance of our findings.
翻译:模仿学习(IL)旨在通过从演示中学习来模仿专家在序列决策任务中的行为,已广泛应用于机器人学、自动驾驶和自回归文本生成。最简单的IL方法——行为克隆(BC)——被认为会导致样本复杂度对问题视野产生不利的二次依赖,这促使了各种不同在线算法的出现,这些算法在数据和学习者对专家访问的更强假设下获得了改进的线性视野依赖。我们从学习理论的角度重新审视离线和在线IL之间的明显差距,重点关注可实现的/良好设定的情况,包括直至深度神经网络的一般策略类别。通过对使用对数损失的行为克隆进行新的分析,我们证明,只要(i)累积奖励的范围受控,且(ii)策略类别的适当监督学习复杂度受控,离线IL就有可能实现与视野无关的样本复杂度。将我们的结果专门应用于确定性的、平稳的策略,我们表明离线和在线IL之间的差距比之前认为的要小:(i)在密集奖励下,离线IL可以实现对视野的线性依赖(这与之前仅在在线IL中已知可实现的结果相匹配);(ii)在没有对策略类别进一步假设的情况下,即使是在良性的MDP中,在线IL也无法超越使用对数损失的离线IL。我们通过在标准RL任务和自回归语言生成上的实验来补充我们的理论结果,以验证我们发现的实践相关性。