Breaking the Curse of Multiagents in a Large State Space: RL in Markov Games with Independent Linear Function Approximation

We propose a new model, independent linear Markov game, for multi-agent reinforcement learning with a large state space and a large number of agents. This is a class of Markov games with independent linear function approximation, where each agent has its own function approximation for the state-action value functions that are marginalized by other players' policies. We design new algorithms for learning the Markov coarse correlated equilibria (CCE) and Markov correlated equilibria (CE) with sample complexity bounds that only scale polynomially with each agent's own function class complexity, thus breaking the curse of multiagents. In contrast, existing works for Markov games with function approximation have sample complexity bounds scale with the size of the \emph{joint action space} when specialized to the canonical tabular Markov game setting, which is exponentially large in the number of agents. Our algorithms rely on two key technical innovations: (1) utilizing policy replay to tackle non-stationarity incurred by multiple agents and the use of function approximation; (2) separating learning Markov equilibria and exploration in the Markov games, which allows us to use the full-information no-regret learning oracle instead of the stronger bandit-feedback no-regret learning oracle used in the tabular setting. Furthermore, we propose an iterative-best-response type algorithm that can learn pure Markov Nash equilibria in independent linear Markov potential games. In the tabular case, by adapting the policy replay mechanism for independent linear Markov games, we propose an algorithm with $\widetilde{O}(\epsilon^{-2})$ sample complexity to learn Markov CCE, which improves the state-of-the-art result $\widetilde{O}(\epsilon^{-3})$ in Daskalakis et al. 2022, where $\epsilon$ is the desired accuracy, and also significantly improves other problem parameters.

翻译：我们提出了一种新模型——独立线性马尔可夫博弈，用于解决具有大状态空间和大量智能体的多智能体强化学习问题。该类马尔可夫博弈采用独立线性函数近似，其中每个智能体对状态-动作值函数拥有自己的函数近似，且这些函数通过其他智能体的策略进行了边缘化。我们设计了学习马尔可夫粗相关均衡（CCE）和马尔可夫相关均衡（CE）的新算法，其样本复杂度仅随每个智能体自身函数类复杂度呈多项式增长，从而打破了多智能体的维度诅咒。相比之下，现有采用函数近似的马尔可夫博弈工作中，当特化为标准表格马尔可夫博弈设置时，其样本复杂度边界会随联合动作空间的大小（该规模随智能体数量呈指数增长）而增长。我们的算法依赖于两项关键技术革新：(1) 利用策略回放来应对多智能体及函数近似带来的非平稳性；(2) 将马尔可夫均衡的学习与马尔可夫博弈中的探索分离，这使得我们能够使用全信息无遗憾学习预言机，而非表格设置中使用的更强的带反馈无遗憾学习预言机。此外，我们提出了一种迭代最优响应类型的算法，可以在独立线性马尔可夫势博弈中学习纯马尔可夫纳什均衡。在表格情形下，通过为独立线性马尔可夫博弈调整策略回放机制，我们提出了一种样本复杂度为$\widetilde{O}(\epsilon^{-2})$的算法来学习马尔可夫CCE，这改进了Daskalakis等人2022年工作中最优结果$\widetilde{O}(\epsilon^{-3})$（其中$\epsilon$为期望精度），并显著优化了其他问题参数。