Mean-Field Control based Approximation of Multi-Agent Reinforcement Learning in Presence of a Non-decomposable Shared Global State

Mean Field Control (MFC) is a powerful approximation tool to solve large-scale Multi-Agent Reinforcement Learning (MARL) problems. However, the success of MFC relies on the presumption that given the local states and actions of all the agents, the next (local) states of the agents evolve conditionally independent of each other. Here we demonstrate that even in a MARL setting where agents share a common global state in addition to their local states evolving conditionally independently (thus introducing a correlation between the state transition processes of individual agents), the MFC can still be applied as a good approximation tool. The global state is assumed to be non-decomposable i.e., it cannot be expressed as a collection of local states of the agents. We compute the approximation error as $\mathcal{O}(e)$ where $e=\frac{1}{\sqrt{N}}\left[\sqrt{|\mathcal{X}|} +\sqrt{|\mathcal{U}|}\right]$. The size of the agent population is denoted by the term $N$, and $|\mathcal{X}|, |\mathcal{U}|$ respectively indicate the sizes of (local) state and action spaces of individual agents. The approximation error is found to be independent of the size of the shared global state space. We further demonstrate that in a special case if the reward and state transition functions are independent of the action distribution of the population, then the error can be improved to $e=\frac{\sqrt{|\mathcal{X}|}}{\sqrt{N}}$. Finally, we devise a Natural Policy Gradient based algorithm that solves the MFC problem with $\mathcal{O}(\epsilon^{-3})$ sample complexity and obtains a policy that is within $\mathcal{O}(\max\{e,\epsilon\})$ error of the optimal MARL policy for any $\epsilon>0$.

翻译：平均场控制（MFC）是解决大规模多智能体强化学习（MARL）问题的强大近似工具。然而，MFC的成功依赖于一个假设：在给定所有智能体的局部状态和动作时，各智能体的下一个（局部）状态相互条件独立演化。本文证明，即使在一个MARL环境中，智能体除了具有条件独立演化的局部状态外，还共享一个共同的全局状态（从而在个体智能体的状态转移过程中引入相关性），MFC仍可作为有效的近似工具。该全局状态假定为不可分解的，即无法表示为智能体局部状态的集合。我们将近似误差计算为$\mathcal{O}(e)$，其中$e=\frac{1}{\sqrt{N}}\left[\sqrt{|\mathcal{X}|} +\sqrt{|\mathcal{U}|}\right]$。参数$N$表示智能体群体的规模，$|\mathcal{X}|$和$|\mathcal{U}|$分别表示单个智能体的（局部）状态空间和动作空间的大小。研究发现该近似误差与共享全局状态空间的大小无关。我们进一步证明，在特殊情况下，如果奖励函数和状态转移函数与群体动作分布无关，则误差可改进为$e=\frac{\sqrt{|\mathcal{X}|}}{\sqrt{N}}$。最后，我们设计了一种基于自然策略梯度的算法，该算法以$\mathcal{O}(\epsilon^{-3})$的样本复杂度求解MFC问题，并得到与最优MARL策略误差为$\mathcal{O}(\max\{e,\epsilon\})$的策略（对任意$\epsilon>0$）。