We study provable multi-agent reinforcement learning (RL) in the general framework of partially observable stochastic games (POSGs). To circumvent the known hardness results and the use of computationally intractable oracles, we advocate leveraging the potential \emph{information-sharing} among agents, a common practice in empirical multi-agent RL, and a standard model for multi-agent control systems with communication. We first establish several computational complexity results to justify the necessity of information-sharing, as well as the observability assumption that has enabled quasi-polynomial time and sample single-agent RL with partial observations, for tractably solving POSGs. Inspired by the inefficiency of planning in the ground-truth model, we then propose to further \emph{approximate} the shared common information to construct an approximate model of the POSG, in which an approximate \emph{equilibrium} (of the original POSG) can be found in quasi-polynomial-time, under the aforementioned assumptions. Furthermore, we develop a partially observable multi-agent RL algorithm whose time and sample complexities are \emph{both} quasi-polynomial. Finally, beyond equilibrium learning, we extend our algorithmic framework to finding the \emph{team-optimal solution} in cooperative POSGs, i.e., decentralized partially observable Markov decision processes, a more challenging goal. We establish concrete computational and sample complexities under several structural assumptions of the model. We hope our study could open up the possibilities of leveraging and even designing different \emph{information structures}, a well-studied notion in control theory, for developing both sample- and computation-efficient partially observable multi-agent RL.
翻译:我们研究了部分可观测随机博弈(POSGs)一般框架下可证明的多智能体强化学习(RL)。为了规避已知的硬度结果以及计算上难以处理的预言机的使用,我们主张利用智能体之间潜在的信息共享——这是经验性多智能体RL中的常见做法,也是具有通信的多智能体控制系统的标准模型。我们首先建立了若干计算复杂性结果,以证明信息共享的必要性,以及使得具有部分观测的单智能体RL能够在拟多项式时间和样本复杂度内实现的观测性假设,对于可处理地求解POSGs是必需的。受在真实模型中规划效率低下的启发,我们进一步提出近似共享的公共信息,以构建POSG的一个近似模型。在上述假设下,可以在该近似模型中以拟多项式时间找到原始POSG的一个近似均衡。此外,我们开发了一种部分可观测多智能体RL算法,其时间和样本复杂度均为拟多项式。最后,除了均衡学习之外,我们将算法框架扩展到在合作性POSGs(即分散式部分可观测马尔可夫决策过程)中寻找团队最优解——这是一个更具挑战性的目标。我们在模型的若干结构性假设下,建立了具体的计算和样本复杂度。我们希望我们的研究能够开启利用甚至设计不同的信息结构(控制理论中一个被深入研究的概念)的可能性,以开发样本和计算高效的部分可观测多智能体RL。