We consider online reinforcement learning in Mean-Field Games (MFGs). Unlike traditional approaches, we alleviate the need for a mean-field oracle by developing an algorithm that approximates the Mean-Field Equilibrium (MFE) using the single sample path of the generic agent. We call this {\it Sandbox Learning}, as it can be used as a warm-start for any agent learning in a multi-agent non-cooperative setting. We adopt a two time-scale approach in which an online fixed-point recursion for the mean-field operates on a slower time-scale, in tandem with a control policy update on a faster time-scale for the generic agent. Given that the underlying Markov Decision Process (MDP) of the agent is communicating, we provide finite sample convergence guarantees in terms of convergence of the mean-field and control policy to the mean-field equilibrium. The sample complexity of the Sandbox learning algorithm is $\tilde{\mathcal{O}}(\epsilon^{-4})$ where $\epsilon$ is the MFE approximation error. This is similar to works which assume access to oracle. Finally, we empirically demonstrate the effectiveness of the sandbox learning algorithm in diverse scenarios, including those where the MDP does not necessarily have a single communicating class.
翻译:我们研究平均场博弈中的在线强化学习。与传统方法不同,我们通过开发一种利用智能体通用单样本路径来近似平均场均衡的算法,消除了对平均场预言机的需求。我们将此称为沙箱学习,因为它可作为多智能体非合作环境中任何智能体学习的预热启动。我们采用双时间尺度方法,其中平均场的在线不动点递归以较慢时间尺度运行,同时通用智能体的控制策略更新以较快时间尺度并行推进。在智能体底层马尔可夫决策过程具有通信性的前提下,我们提供了关于平均场和控制策略收敛到平均场均衡的有限样本收敛保证。沙箱学习算法的样本复杂度为$\tilde{\mathcal{O}}(\epsilon^{-4})$,其中$\epsilon$为平均场均衡逼近误差。该复杂度与假设可访问预言机的现有工作相当。最后,我们通过实验验证了沙箱学习算法在多种场景下的有效性,包括马尔可夫决策过程不一定具有单一通信类的场景。