On the Complexity of Multi-Agent Decision Making: From Learning in Games to Partial Monitoring

A central problem in the theory of multi-agent reinforcement learning (MARL) is to understand what structural conditions and algorithmic principles lead to sample-efficient learning guarantees, and how these considerations change as we move from few to many agents. We study this question in a general framework for interactive decision making with multiple agents, encompassing Markov games with function approximation and normal-form games with bandit feedback. We focus on equilibrium computation, in which a centralized learning algorithm aims to compute an equilibrium by controlling multiple agents that interact with an unknown environment. Our main contributions are: - We provide upper and lower bounds on the optimal sample complexity for multi-agent decision making based on a multi-agent generalization of the Decision-Estimation Coefficient, a complexity measure introduced by Foster et al. (2021) in the single-agent counterpart to our setting. Compared to the best results for the single-agent setting, our bounds have additional gaps. We show that no "reasonable" complexity measure can close these gaps, highlighting a striking separation between single and multiple agents. - We show that characterizing the statistical complexity for multi-agent decision making is equivalent to characterizing the statistical complexity of single-agent decision making, but with hidden (unobserved) rewards, a framework that subsumes variants of the partial monitoring problem. As a consequence, we characterize the statistical complexity for hidden-reward interactive decision making to the best extent possible. Building on this development, we provide several new structural results, including 1) conditions under which the statistical complexity of multi-agent decision making can be reduced to that of single-agent, and 2) conditions under which the so-called curse of multiple agents can be avoided.

翻译：多智能体强化学习（MARL）理论中的一个核心问题是理解哪些结构条件和算法原理能够带来样本高效的学习保证，以及当我们从少量智能体扩展到大量智能体时，这些考量如何变化。我们在一个包含多智能体交互决策的通用框架中研究这一问题，该框架涵盖了具有函数逼近的马尔可夫博弈和具有赌博机反馈的规范式博弈。我们聚焦于均衡计算问题，其中集中式学习算法通过控制多个与未知环境交互的智能体来求解均衡。我们的主要贡献包括：- 我们基于决策-估计系数（一种由Foster等人（2021）在单智能体对应场景中引入的复杂度度量）的多智能体推广，给出了多智能体决策最优样本复杂度的上界和下界。与单智能体场景的最佳结果相比，我们的界存在额外间隙。我们证明没有“合理的”复杂度度量能够弥合这些间隙，从而凸显了单智能体与多智能体之间的显著差异。- 我们证明，刻画多智能体决策的统计复杂度等价于刻画具有隐藏（未观测）奖励的单智能体决策的统计复杂度，而这一框架涵盖了部分监控问题的变体。基于此，我们尽可能充分地刻画了隐藏奖励交互式决策的统计复杂度。在此基础上，我们提供了若干新的结构性结果，包括：1）多智能体决策的统计复杂度可简化为单智能体统计复杂度的条件；2）所谓的多智能体诅咒可被避免的条件。