Multi-agent imitation learning (MA-IL) aims to learn optimal policies from expert demonstrations of interactions in multi-agent interactive domains. Despite existing guarantees on the performance of the resulting learned policies, characterizations of how far the learned polices are from a Nash equilibrium are missing for offline MA-IL. In this paper, we demonstrate impossibility and hardness results of learning low-exploitable policies in general $n$-player Markov Games. We do so by providing examples where even exact measure matching fails, and demonstrating a new hardness result on characterizing the Nash gap given a fixed measure matching error. We then show how these challenges can be overcome using strategic dominance assumptions on the expert equilibrium. Specifically, for the case of dominant strategy expert equilibria, assuming Behavioral Cloning error $ε_{\text{BC}}$, this provides a Nash imitation gap of $\mathcal{O}\left(nε_{\text{BC}}/(1-γ)^2\right)$ for a discount factor $γ$. We generalize this result with a new notion of best-response continuity, and argue that this is implicitly encouraged by standard regularization techniques.
翻译:多智能体模仿学习(MA-IL)旨在通过多智能体交互领域中的专家示范交互数据来学习最优策略。尽管现有研究对所学策略的性能提供了保证,但在离线MA-IL中,关于所学策略与纳什均衡之间距离的刻画仍然缺失。本文证明了在一般$n$人马尔可夫博弈中学习低可剥削性策略的不可能性与困难性。我们通过构建精确度量匹配仍会失效的示例,并针对给定固定度量匹配误差时刻画纳什间隙的困难性提出新的硬度结果来论证这一点。随后,我们说明如何通过对专家均衡施加策略占优假设来克服这些挑战。具体而言,在专家均衡为占优策略均衡的情况下,假设行为克隆误差为$ε_{\text{BC}}$,可推导出纳什模仿间隙为$\mathcal{O}\left(nε_{\text{BC}}/(1-γ)^2\right)$,其中$γ$为折扣因子。我们通过提出最佳响应连续性的新概念推广了这一结果,并论证标准正则化技术会隐式促进该性质。