Markov Games (MG) is an important model for Multi-Agent Reinforcement Learning (MARL). It was long believed that the "curse of multi-agents" (i.e., the algorithmic performance drops exponentially with the number of agents) is unavoidable until several recent works (Daskalakis et al., 2023; Cui et al., 2023; Wang et al., 2023. While these works did resolve the curse of multi-agents, when the state spaces are prohibitively large and (linear) function approximations are deployed, they either had a slower convergence rate of $O(T^{-1/4})$ or brought a polynomial dependency on the number of actions $A_{\max}$ -- which is avoidable in single-agent cases even when the loss functions can arbitrarily vary with time (Dai et al., 2023). This paper first refines the `AVLPR` framework by Wang et al. (2023), with an insight of *data-dependent* (i.e., stochastic) pessimistic estimation of the sub-optimality gap, allowing a broader choice of plug-in algorithms. When specialized to MGs with independent linear function approximations, we propose novel *action-dependent bonuses* to cover occasionally extreme estimation errors. With the help of state-of-the-art techniques from the single-agent RL literature, we give the first algorithm that tackles the curse of multi-agents, attains the optimal $O(T^{-1/2})$ convergence rate, and avoids $\text{poly}(A_{\max})$ dependency simultaneously.
翻译:马尔可夫博弈(MG)是多智能体强化学习(MARL)中的重要模型。长期以来,人们普遍认为"多智能体诅咒"(即算法性能随智能体数量呈指数级下降)难以避免,直至近期多项研究(Daskalakis等,2023;Cui等,2023;Wang等,2023)才得以解决。虽然这些工作确实消除了多智能体诅咒,但在状态空间极大且采用(线性)函数逼近的情况下,它们要么收敛速度较慢($O(T^{-1/4})$),要么对动作数量$A_{\max}$存在多项式依赖——而在单智能体情形中,即使损失函数随时间任意变化(Dai等,2023),这种依赖也可以避免。本文首先优化了Wang等(2023)提出的“AVLPR”框架,通过引入*数据依赖*(即随机性)的次优性间隙悲观估计方法,扩展了插件算法的可选范围。针对独立线性函数逼近下的马尔可夫博弈,我们提出新颖的*动作依赖惩罚项*以覆盖偶发的极端估计误差。借助单智能体强化学习领域的最新技术,我们首次提出能同时实现以下目标的算法:破解多智能体诅咒、达到最优$O(T^{-1/2})$收敛速率,且无需引入$\text{poly}(A_{\max})$依赖项。