Designing efficient algorithms for multi-agent reinforcement learning (MARL) is fundamentally challenging due to the fact that the size of the joint state and action spaces are exponentially large in the number of agents. These difficulties are exacerbated when balancing sequential global decision-making with local agent interactions. In this work, we propose a new algorithm \texttt{SUBSAMPLE-MFQ} (\textbf{Subsample}-\textbf{M}ean-\textbf{F}ield-\textbf{Q}-learning) and a decentralized randomized policy for a system with $n$ agents. For $k\leq n$, our algorithm system learns a policy for the system in time polynomial in $k$. We show that this learned policy converges to the optimal policy in the order of $\tilde{O}(1/\sqrt{k})$ as the number of subsampled agents $k$ increases. We validate our method empirically on Gaussian squeeze and global exploration settings.
翻译:多智能体强化学习(MARL)算法设计面临根本性挑战,因为联合状态与动作空间的规模随智能体数量呈指数级增长。当需要平衡序列化全局决策与局部智能体交互时,这些困难会进一步加剧。本文提出一种新算法 \texttt{SUBSAMPLE-MFQ}(子采样均值场Q学习)及适用于 $n$ 个智能体系统的去中心化随机策略。当 $k\leq n$ 时,我们的算法系统可在 $k$ 的多项式时间内学习系统策略。我们证明随着子采样智能体数量 $k$ 的增加,该学习策略以 $\tilde{O}(1/\sqrt{k})$ 的阶数收敛至最优策略。我们在高斯挤压与全局探索场景中通过实验验证了所提方法的有效性。