Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity

Model-based reinforcement learning (RL), which finds an optimal policy using an empirical model, has long been recognized as one of the corner stones of RL. It is especially suitable for multi-agent RL (MARL), as it naturally decouples the learning and the planning phases, and avoids the non-stationarity problem when all agents are improving their policies simultaneously using samples. Though intuitive and widely-used, the sample complexity of model-based MARL algorithms has not been fully investigated. In this paper, our goal is to address the fundamental question about its sample complexity. We study arguably the most basic MARL setting: two-player discounted zero-sum Markov games, given only access to a generative model. We show that model-based MARL achieves a sample complexity of $\tilde O(|S||A||B|(1-\gamma)^{-3}\epsilon^{-2})$ for finding the Nash equilibrium (NE) value up to some $\epsilon$ error, and the $\epsilon$-NE policies with a smooth planning oracle, where $\gamma$ is the discount factor, and $S,A,B$ denote the state space, and the action spaces for the two agents. We further show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge, by establishing a matching lower bound. This is in contrast to the usual reward-aware setting, with a $\tilde\Omega(|S|(|A|+|B|)(1-\gamma)^{-3}\epsilon^{-2})$ lower bound, where this model-based approach is near-optimal with only a gap on the $|A|,|B|$ dependence. Our results not only demonstrate the sample-efficiency of this basic model-based approach in MARL, but also elaborate on the fundamental tradeoff between its power (easily handling the more challenging reward-agnostic case) and limitation (less adaptive and suboptimal in $|A|,|B|$), particularly arises in the multi-agent context.

翻译：基于模型的强化学习通过利用经验模型寻找最优策略，长期被视为强化学习的基石之一。该方法尤其适用于多智能体强化学习，因其自然地将学习与规划阶段解耦，并避免了所有智能体同时利用样本改进策略时出现的非平稳性问题。尽管该算法直观且应用广泛，但基于模型的多智能体强化学习算法的样本复杂度尚未得到充分研究。本文旨在解决其样本复杂度的基本问题。我们研究了最基础的多智能体强化学习设定：在仅能访问生成模型的情况下，求解双人折扣零和马尔可夫博弈。研究表明，基于模型的多智能体强化学习在寻找纳什均衡值（误差不超过$\epsilon$）时，样本复杂度为$\tilde O(|S||A||B|(1-\gamma)^{-3}\epsilon^{-2})$，通过平滑规划预言机可得到$\epsilon$-纳什均衡策略，其中$\gamma$为折扣因子，$S,A,B$分别表示状态空间及两个智能体的动作空间。我们进一步证明：若算法为奖励无关型（即在不了解奖励信息的情况下查询状态转移样本），通过建立匹配的下界，该样本界在极小极大意义下（忽略对数因子）达到最优。这与常规奖励感知设定形成对比——后者下界为$\tilde\Omega(|S|(|A|+|B|)(1-\gamma)^{-3}\epsilon^{-2})$，此时该基于模型的方法仅在$|A|,|B|$依赖关系上存在间隙，仍具有近最优性能。我们的结果不仅揭示了多智能体强化学习中这一基础模型驱动方法的样本效率，更阐明了其在多智能体情境下特有的能力（轻松处理更具挑战性的奖励无关场景）与局限性（适应性较弱且在$|A|,|B|$项上非最优）之间的根本权衡。