可证明内存高效的无模型强化学习自博弈算法 (Provable Memory Efficient Self-Play Algorithm for Model-free Reinforcement Learning)

The thriving field of multi-agent reinforcement learning (MARL) studies how a group of interacting agents make decisions autonomously in a shared dynamic environment. Existing theoretical studies in this area suffer from at least two of the following obstacles: memory inefficiency, the heavy dependence of sample complexity on the long horizon and the large state space, the high computational complexity, non-Markov policy, non-Nash policy, and high burn-in cost. In this work, we take a step towards settling this problem by designing a model-free self-play algorithm \emph{Memory-Efficient Nash Q-Learning (ME-Nash-QL)} for two-player zero-sum Markov games, which is a specific setting of MARL. ME-Nash-QL is proven to enjoy the following merits. First, it can output an $\varepsilon$-approximate Nash policy with space complexity $O(SABH)$ and sample complexity $\widetilde{O}(H^4SAB/\varepsilon^2)$, where $S$ is the number of states, $\{A, B\}$ is the number of actions for two players, and $H$ is the horizon length. It outperforms existing algorithms in terms of space complexity for tabular cases, and in terms of sample complexity for long horizons, i.e., when $\min\{A, B\}\ll H^2$. Second, ME-Nash-QL achieves the lowest computational complexity $O(T\mathrm{poly}(AB))$ while preserving Markov policies, where $T$ is the number of samples. Third, ME-Nash-QL also achieves the best burn-in cost $O(SAB\,\mathrm{poly}(H))$, whereas previous algorithms have a burn-in cost of at least $O(S^3 AB\,\mathrm{poly}(H))$ to attain the same level of sample complexity with ours.

翻译：多智能体强化学习（MARL）这一蓬勃发展的领域研究一组相互作用的智能体如何在共享的动态环境中自主决策。该领域现有的理论研究至少面临以下障碍中的两项：内存效率低下、样本复杂度严重依赖于长时域和大状态空间、高计算复杂度、非马尔可夫策略、非纳什策略以及高预热成本。在本工作中，我们通过为双人零和马尔可夫博弈（MARL的一种特定设定）设计一种无模型自博弈算法——内存高效纳什Q学习（ME-Nash-QL），朝着解决该问题迈出了一步。ME-Nash-QL被证明具有以下优点。首先，它能够以空间复杂度$O(SABH)$和样本复杂度$\widetilde{O}(H^4SAB/\varepsilon^2)$输出一个$\varepsilon$近似纳什策略，其中$S$为状态数，$\{A, B\}$为两名玩家的动作数，$H$为时域长度。在表格型案例中，其空间复杂度优于现有算法；在长时域场景（即当$\min\{A, B\}\ll H^2$时）下，其样本复杂度也更具优势。其次，ME-Nash-QL在保持马尔可夫策略的同时，实现了最低的计算复杂度$O(T\mathrm{poly}(AB))$，其中$T$为样本数量。第三，ME-Nash-QL还达到了最佳的预热成本$O(SAB\,\mathrm{poly}(H))$，而先前算法要达到与我们相同水平的样本复杂度，其预热成本至少为$O(S^3 AB\,\mathrm{poly}(H))$。