Offline multi-agent reinforcement learning in general-sum settings is challenged by the distribution shift between logged datasets and target equilibrium policies. While standard methods rely on manual pessimistic penalties, we demonstrate that KL regularization suffices to stabilize learning and achieve equilibrium recovery. We propose General-sum Anchored Nash Equilibrium (GANE), which recovers regularized Nash equilibria at an accelerated statistical rate of $\widetilde{O}(1/n)$. For computational tractability, we develop General-sum Anchored Mirror Descent (GAMD), an iterative algorithm converging to a Coarse Correlated Equilibrium at the standard rate of $\widetilde{O}(1/\sqrt{n}+1/T)$. These results establish KL regularization as a standalone mechanism for pessimism-free offline learning that achieves equivalent or accelerated rates in multi-player general-sum games.
翻译:离线多智能体强化学习在一般和设定下面临着日志数据集与目标均衡策略之间分布偏移的挑战。标准方法依赖于手动设置的悲观惩罚,而我们证明KL正则化足以稳定学习并实现均衡恢复。我们提出一般和锚定纳什均衡(GANE),能以$\widetilde{O}(1/n)$的加速统计速率恢复正则化纳什均衡。为了计算可行性,我们开发了通用和锚定镜像下降(GAMD)算法,该迭代算法以$\widetilde{O}(1/\sqrt{n}+1/T)$的标准速率收敛到粗糙相关均衡。这些结果表明,KL正则化作为一种独立的机制,能够实现无悲观主义的离线学习,并在多玩家一般和博弈中达到等效或加速的速率。