Mean-field games have been used as a theoretical tool to obtain an approximate Nash equilibrium for symmetric and anonymous $N$-player games. However, limiting applicability, existing theoretical results assume variations of a "population generative model", which allows arbitrary modifications of the population distribution by the learning algorithm. Moreover, learning algorithms typically work on abstract simulators with population instead of the $N$-player game. Instead, we show that $N$ agents running policy mirror ascent converge to the Nash equilibrium of the regularized game within $\widetilde{\mathcal{O}}(\varepsilon^{-2})$ samples from a single sample trajectory without a population generative model, up to a standard $\mathcal{O}(\frac{1}{\sqrt{N}})$ error due to the mean field. Taking a divergent approach from the literature, instead of working with the best-response map we first show that a policy mirror ascent map can be used to construct a contractive operator having the Nash equilibrium as its fixed point. We analyze single-path TD learning for $N$-agent games, proving sample complexity guarantees by only using a sample path from the $N$-agent simulator without a population generative model. Furthermore, we demonstrate that our methodology allows for independent learning by $N$ agents with finite sample guarantees.
翻译:平均场博弈被用作对称匿名$N$人博弈中近似纳什均衡的理论工具。然而,现有理论结果假设存在“种群生成模型”的变体,允许学习算法任意修改种群分布,这限制了其适用性。此外,学习算法通常基于抽象模拟器(而非$N$人博弈)中的种群进行工作。相反,我们证明:$N$个智能体独立运行策略镜像上升算法,将在$\widetilde{\mathcal{O}}(\varepsilon^{-2})$次单样本轨迹采样内收敛到正则化博弈的纳什均衡——无需种群生成模型,仅面临因平均场近似导致的标准$\mathcal{O}(\frac{1}{\sqrt{N}})$误差。与现有文献采用最优响应映射的方法不同,我们首先证明策略镜像上升映射可构造一个以纳什均衡为不动点的收缩算子。针对$N$人博弈,我们分析了单轨迹时序差分学习,在仅使用$N$智能体模拟器单一样本路径且无需种群生成模型的条件下,证明了样本复杂度保证。进一步,我们展示了该方法允许$N$个智能体在有限样本保证下实现独立学习。