We study the problem of multi-agent reinforcement learning (MARL) with adaptivity constraints -- a new problem motivated by real-world applications where deployments of new policies are costly and the number of policy updates must be minimized. For two-player zero-sum Markov Games, we design a (policy) elimination based algorithm that achieves a regret of $\widetilde{O}(\sqrt{H^3 S^2 ABK})$, while the batch complexity is only $O(H+\log\log K)$. In the above, $S$ denotes the number of states, $A,B$ are the number of actions for the two players respectively, $H$ is the horizon and $K$ is the number of episodes. Furthermore, we prove a batch complexity lower bound $\Omega(\frac{H}{\log_{A}K}+\log\log K)$ for all algorithms with $\widetilde{O}(\sqrt{K})$ regret bound, which matches our upper bound up to logarithmic factors. As a byproduct, our techniques naturally extend to learning bandit games and reward-free MARL within near optimal batch complexity. To the best of our knowledge, these are the first line of results towards understanding MARL with low adaptivity.
翻译:我们研究了具有适应性约束的多智能体强化学习(MARL)问题——这是一个由实际应用驱动的新问题,在这些应用中部署新策略成本高昂,且必须最小化策略更新次数。针对双人零和马尔可夫博弈,我们设计了一种基于(策略)消除的算法,在批次复杂度仅为$O(H+\log\log K)$的条件下实现了$\widetilde{O}(\sqrt{H^3 S^2 ABK})$的遗憾值。其中$S$表示状态数,$A,B$分别为两玩家的动作数,$H$为决策时域,$K$为回合数。此外,对于所有具有$\widetilde{O}(\sqrt{K})$遗憾界的算法,我们证明了批次复杂度下界为$\Omega(\frac{H}{\log_{A}K}+\log\log K)$,该下界在忽略对数因子时与上界匹配。作为副产品,我们的方法自然扩展至学习赌博机博弈和无奖励MARL,并保持近似最优的批次复杂度。据我们所知,这是理解低适应性MARL的首批系统性研究成果。