Methods like multi-agent reinforcement learning struggle to scale with growing population size. Mean-field games (MFGs) are a game-theoretic approach that can circumvent this by finding a solution for an abstract infinite population, which can then be used as an approximate solution for the $N$-agent problem. However, classical mean-field algorithms usually only work under restrictive conditions. We take steps to address this by introducing networked communication to MFGs, in particular to settings that use a single, non-episodic run of $N$ decentralised agents to simulate the infinite population, as is likely to be most reasonable in real-world deployments. We prove that our architecture's sample guarantees lie between those of earlier theoretical algorithms for the centralised- and independent-learning architectures, varying dependent on network structure and the number of communication rounds. However, the sample guarantees of the three theoretical algorithms do not actually result in practical convergence times. We thus contribute practical enhancements to all three algorithms allowing us to present their first empirical demonstrations. We then show that in practical settings where the theoretical hyperparameters are not observed, giving fewer loops but poorer estimation of the Q-function, our communication scheme still respects the earlier theoretical analysis: it considerably accelerates learning over the independent case, which hardly seems to learn at all, and often performs similarly to the centralised case, while removing the restrictive assumption of the latter. We provide ablations and additional studies showing that our networked approach also has advantages over both alternatives in terms of robustness to update failures and to changes in population size.
翻译:多智能体强化学习等方法难以随种群规模扩大而有效扩展。平均场博弈(MFGs)作为一种博弈论方法,通过求解抽象无限种群问题来规避这一限制,所得解可作为 $N$ 智能体问题的近似解。然而经典平均场算法通常仅在严格限制条件下有效。本文通过将网络化通信引入平均场博弈来推进该问题的解决,特别针对采用单次非回合制 $N$个去中心化智能体模拟无限种群的场景——这种设置在现实部署中最为合理。我们证明该架构的样本保证介于早期集中式与独立学习架构理论算法之间,具体取决于网络结构和通信轮次。但三种理论算法的样本保证并未实际转化为可行的收敛时间。因此我们对所有三种算法提出实用性改进,首次实现其经验验证。随后证明在未满足理论超参数的实际场景中(此时循环次数减少且 Q 函数估计精度下降),我们的通信方案仍遵循先前的理论分析:相较于几乎无法有效学习的独立学习方案,该方案显著加速学习过程;在移除集中式方案的严格假设条件下,其性能常与集中式方案相当。我们通过消融实验与补充研究表明,在网络化方法在更新容错性和种群规模变化适应性方面均优于两种对比方案。