We consider the problem of decentralized multi-agent reinforcement learning in Markov games. A fundamental question is whether there exist algorithms that, when adopted by all agents and run independently in a decentralized fashion, lead to no-regret for each player, analogous to celebrated convergence results in normal-form games. While recent work has shown that such algorithms exist for restricted settings (notably, when regret is defined with respect to deviations to Markovian policies), the question of whether independent no-regret learning can be achieved in the standard Markov game framework was open. We provide a decisive negative resolution this problem, both from a computational and statistical perspective. We show that: - Under the widely-believed assumption that PPAD-hard problems cannot be solved in polynomial time, there is no polynomial-time algorithm that attains no-regret in general-sum Markov games when executed independently by all players, even when the game is known to the algorithm designer and the number of players is a small constant. - When the game is unknown, no algorithm, regardless of computational efficiency, can achieve no-regret without observing a number of episodes that is exponential in the number of players. Perhaps surprisingly, our lower bounds hold even for seemingly easier setting in which all agents are controlled by a a centralized algorithm. They are proven via lower bounds for a simpler problem we refer to as SparseCCE, in which the goal is to compute a coarse correlated equilibrium that is sparse in the sense that it can be represented as a mixture of a small number of product policies. The crux of our approach is a novel application of aggregation techniques from online learning, whereby we show that any algorithm for the SparseCCE problem can be used to compute approximate Nash equilibria for non-zero sum normal-form games.
翻译:我们研究马尔可夫博弈中的分散式多智能体强化学习问题。一个基本问题是:是否存在这样的算法,当所有智能体独立采用并分散运行时,能够使每个玩家实现无遗憾(no-regret),类似于正规形式博弈中著名的收敛结果。尽管近期研究表明此类算法存在于受限场景(特别是当遗憾定义为针对马尔可夫策略的偏离时),但标准马尔可夫博弈框架下能否实现独立无遗憾学习这一开放问题至今未解。我们从计算和统计两个角度对该问题给出了决定性的否定答案。我们证明:- 在广泛接受的PPAD难问题无法在多项式时间内求解的假设下,不存在多项式时间算法能使所有玩家在一般和马尔可夫博弈中独立运行时实现无遗憾,即使博弈对算法设计者已知且玩家数量为小的常数。- 当博弈未知时,无论计算效率如何,任何算法若未观察到随玩家数量呈指数增长的回合数,均无法实现无遗憾。或许令人惊讶的是,即使所有智能体均由集中式算法控制的看似更简单场景,我们的下界仍然成立。这些结果通过一个称为稀疏CCE的更简单问题的下界得到证明,该问题的目标是计算粗相关均衡,且该均衡是稀疏的——即可以表示为少量乘积策略的混合。我们方法的核心是来自在线学习的聚合技术的新颖应用,通过该技术我们证明:任何稀疏CCE问题的算法都可被用于计算非零和正规形式博弈的近似纳什均衡。