We close open theoretical gaps in Multi-Agent Imitation Learning (MAIL) by characterizing the limits of non-interactive MAIL and presenting the first interactive algorithm with near-optimal sample complexity. In the non-interactive setting, we prove a statistical lower bound that identifies the all-policy deviation concentrability coefficient as the fundamental complexity measure, and we show that Behavior Cloning (BC) is rate-optimal. For the interactive setting, we introduce a framework that combines reward-free reinforcement learning with interactive MAIL and instantiate it with an algorithm, MAIL-WARM. It improves the best previously known sample complexity from $\mathcal{O}(\varepsilon^{-8})$ to $\mathcal{O}(\varepsilon^{-2}),$ matching the dependence on $\varepsilon$ implied by our lower bound. Finally, we provide numerical results that support our theory and illustrate, in environments such as grid worlds, where Behavior Cloning fails to learn.
翻译:我们通过刻画非交互式多智能体模仿学习(MAIL)的局限性并提出首个具有近最优样本复杂度的交互式算法,填补了该领域的理论空白。在非交互式设定下,我们证明了一个统计下界,指出全策略偏离集中系数是根本的复杂度度量,并证明行为克隆(BC)具有速率最优性。对于交互式设定,我们提出了一个将无奖励强化学习与交互式MAIL相结合的理论框架,并通过算法MAIL-WARM实现该框架。该算法将此前已知的最佳样本复杂度从$\mathcal{O}(\varepsilon^{-8})$提升至$\mathcal{O}(\varepsilon^{-2})$,其关于$\varepsilon$的依赖关系与我们的理论下界一致。最后,我们通过网格世界等环境中的数值实验验证理论结果,并展示行为克隆在特定场景中的学习失效现象。