Markov Games (MG) is an important model for Multi-Agent Reinforcement Learning (MARL). It was long believed that the "curse of multi-agents" (i.e., the algorithmic performance drops exponentially with the number of agents) is unavoidable until several recent works (Daskalakis et al., 2023; Cui et al., 2023; Wang et al., 2023). While these works resolved the curse of multi-agents, when the state spaces are prohibitively large and (linear) function approximations are deployed, they either had a slower convergence rate of $O(T^{-1/4})$ or brought a polynomial dependency on the number of actions $A_{\max}$ -- which is avoidable in single-agent cases even when the loss functions can arbitrarily vary with time. This paper first refines the AVLPR framework by Wang et al. (2023), with an insight of designing *data-dependent* (i.e., stochastic) pessimistic estimation of the sub-optimality gap, allowing a broader choice of plug-in algorithms. When specialized to MGs with independent linear function approximations, we propose novel *action-dependent bonuses* to cover occasionally extreme estimation errors. With the help of state-of-the-art techniques from the single-agent RL literature, we give the first algorithm that tackles the curse of multi-agents, attains the optimal $O(T^{-1/2})$ convergence rate, and avoids $\text{poly}(A_{\max})$ dependency simultaneously.
翻译:马尔可夫博弈(MG)是多智能体强化学习(MARL)的重要模型。长期以来,人们认为“多智能体诅咒”(即算法性能随智能体数量指数级下降)是不可避免的,直至近期多项研究(Daskalakis等人,2023;Cui等人,2023;Wang等人,2023)取得突破。尽管这些工作解决了多智能体诅咒问题,但在状态空间极大且需采用(线性)函数逼近的场景下,它们要么仅获得$O(T^{-1/4})$的较慢收敛速率,要么引入了关于动作数量$A_{\max}$的多项式依赖——而这种依赖在单智能体场景中即使损失函数随时间任意变化也是可避免的。本文首先改进了Wang等人(2023)提出的AVLPR框架,其核心洞见在于设计*数据依赖*(即随机性)的次优间隙悲观估计量,从而允许更广泛的插件算法选择。当专门应用于具有独立线性函数逼近的MG时,我们提出了新颖的*动作依赖奖励修正项*以覆盖偶尔出现的极端估计误差。借助来自单智能体RL文献的最新技术,我们提出了首个能同时解决多智能体诅咒、达到最优$O(T^{-1/2})$收敛速率且避免$\text{poly}(A_{\max})$依赖的算法。