In multi-agent reinforcement learning (MARL), independent learners are those that do not observe the actions of other agents in the system. Due to the decentralization of information, it is challenging to design independent learners that drive play to equilibrium. This paper investigates the feasibility of using satisficing dynamics to guide independent learners to approximate equilibrium in stochastic games. For $\epsilon \geq 0$, an $\epsilon$-satisficing policy update rule is any rule that instructs the agent to not change its policy when it is $\epsilon$-best-responding to the policies of the remaining players; $\epsilon$-satisficing paths are defined to be sequences of joint policies obtained when each agent uses some $\epsilon$-satisficing policy update rule to select its next policy. We establish structural results on the existence of $\epsilon$-satisficing paths into $\epsilon$-equilibrium in both symmetric $N$-player games and general stochastic games with two players. We then present an independent learning algorithm for $N$-player symmetric games and give high probability guarantees of convergence to $\epsilon$-equilibrium under self-play. This guarantee is made using symmetry alone, leveraging the previously unexploited structure of $\epsilon$-satisficing paths.
翻译:在多智能体强化学习(MARL)中,独立学习器是指那些不观察系统中其他智能体动作的学习器。由于信息分散化,设计能驱动博弈达到均衡的独立学习器颇具挑战性。本文探究了利用满意动力学引导独立学习器在随机博弈中逼近均衡的可行性。对于$\epsilon \geq 0$,$\epsilon$-满意策略更新规则是指:当智能体对其他参与者的策略做出$\epsilon$-最佳响应时,该规则指示智能体不改变其策略;$\epsilon$-满意路径定义为每个智能体使用某种$\epsilon$-满意策略更新规则选择其下一个策略时所得到的联合策略序列。我们建立了对称$N$人博弈和一般两人随机博弈中$\epsilon$-满意路径进入$\epsilon$-均衡的存在性结构结果。随后,我们提出了一种针对$N$人对称博弈的独立学习算法,并给出了在自博弈条件下以高概率收敛到$\epsilon$-均衡的保证。该保证仅利用对称性,并借助了先前未被开发的$\epsilon$-满意路径结构。