Offline multi-agent reinforcement learning in general-sum settings is challenged by the distribution shift between logged datasets and target equilibrium policies. While standard methods rely on manual pessimistic penalties, we demonstrate that KL regularization suffices to stabilize learning and achieve equilibrium recovery. We propose General-sum Anchored Nash Equilibrium (GANE), which recovers regularized Nash equilibria at an accelerated statistical rate of $\widetilde{O}(1/n)$. For computational tractability, we develop General-sum Anchored Mirror Descent (GAMD), an iterative algorithm converging to a Coarse Correlated Equilibrium at the standard rate of $\widetilde{O}(1/\sqrt{n}+1/T)$. These results establish KL regularization as a standalone mechanism for pessimism-free offline learning that achieves equivalent or accelerated rates in multi-player general-sum games.
翻译:离线多智能体强化学习在一般和设置中面临记录数据集与目标均衡策略之间的分布偏移挑战。虽然标准方法依赖手动悲观惩罚项,但我们证明KL正则化足以稳定学习过程并实现均衡恢复。我们提出一般和锚定纳什均衡(GANE),该方法能以$\widetilde{O}(1/n)$的加速统计速率恢复正则化纳什均衡。为保障计算可行性,我们开发了一般和锚定镜像下降(GAMD)迭代算法,该算法以$\widetilde{O}(1/\sqrt{n}+1/T)$的标准速率收敛到粗相关均衡。这些结果表明,KL正则化可作为独立机制实现无悲观离线学习,并在多人一般和博弈中达到等效或加速的收敛速率。