Mean-field reinforcement learning (MF-RL) scales multi-agent RL to large populations by reducing each agent's dependence on others to a single summary statistic -- the mean action. However, this reduction requires every agent to act at every time step; when some agents are idle, the mean action is simply undefined. Addressing asynchrony therefore requires a different summary statistic -- one that remains defined regardless of which agents act. The population distribution $μ\in Δ(\mathcal{O})$ -- the fraction of agents at each observation -- satisfies this requirement: its dimension is independent of $N$, and under exchangeability it fully determines each agent's reward and transition. Existing MF-RL theory, however, is built on the mean action and does not extend to $μ$. We therefore construct the Temporal Mean Field (TMF) framework around the population distribution $μ$ from scratch, covering the full spectrum from fully synchronous to purely sequential decision-making within a single theory. We prove existence and uniqueness of TMF equilibria, establish an $O(1/\sqrt{N})$ finite-population approximation bound that holds regardless of how many agents act per step, and prove convergence of a policy gradient algorithm (TMF-PG) to the unique equilibrium. Experiments on a resource selection game and a dynamic queueing game confirm that TMF-PG achieves near-identical performance whether one agent or all $N$ act per step, with approximation error decaying at the predicted $O(1/\sqrt{N})$ rate.
翻译:均值场强化学习(MF-RL)通过将每个智能体对其他智能体的依赖简化为单一汇总统计量——平均动作,将多智能体强化学习扩展到大规模群体。然而,这种简化要求每个智能体在每个时间步都采取行动;当部分智能体处于闲置状态时,平均动作便无法定义。因此,处理异步性需要一种不同的汇总统计量——一种无论哪些智能体行动都能保持定义的统计量。群体分布 $μ\in Δ(\mathcal{O})$——即处于每个观测状态的智能体比例——满足此要求:其维度与 $N$ 无关,且在可交换性条件下,它完全决定了每个智能体的奖励与状态转移。然而,现有的 MF-RL 理论建立在平均动作的基础上,无法推广到 $μ$。为此,我们围绕群体分布 $μ$ 从头构建了时序均值场(TMF)框架,该框架在单一理论内涵盖了从完全同步到纯序贯决策的完整谱系。我们证明了 TMF 均衡的存在性与唯一性,建立了一个与每步行动的智能体数量无关的 $O(1/\sqrt{N})$ 有限群体近似界,并证明了策略梯度算法(TMF-PG)收敛于唯一均衡。在资源选择博弈和动态排队博弈上的实验证实,无论每步有一个智能体还是全部 $N$ 个智能体行动,TMF-PG 都能达到近乎相同的性能,且近似误差以理论预测的 $O(1/\sqrt{N})$ 速率衰减。