We study the problem of learning Nash equilibria in offline two-player zero-sum Markov games. While existing approaches often rely on explicit pessimism to address distribution shift, we show that KL regularization alone suffices to stabilize learning and guarantee convergence. We first introduce Regularized Offline Sequential Equilibrium (ROSE), a theoretical framework that achieves a fast $\widetilde{\mathcal{O}}(1/n)$ convergence rate under \textit{unilateral concentrability}, improving over the standard $\widetilde{\mathcal{O}}(1/\sqrt{n})$ rates in unregularized settings. We then propose Sequential Offline Self-play Mirror Descent (SOS-MD), a practical model-free algorithm based on least-squares value estimation and iterative self-play updates. We prove that the last iterate of SOS-MD attains the same $\widetilde{\mathcal{O}}(1/n)$ statistical rate up to a vanishing optimization error of order $\widetilde{\mathcal{O}}(1/\sqrt{T})$ in the number of self-play iterations $T$.
翻译:我们研究离线双人零和马尔可夫博弈中纳什均衡的学习问题。现有方法通常依赖显式悲观主义来处理分布偏移,但我们证明仅使用KL正则化就足以稳定学习过程并保证收敛性。首先引入正则化离线序贯均衡(ROSE)理论框架,该框架在\textit{单边集中性}假设下实现$\widetilde{\mathcal{O}}(1/n)$的快速收敛速率,优于非正则化设定下标准$\widetilde{\mathcal{O}}(1/\sqrt{n})$速率。继而提出基于最小二乘值估计与迭代自博弈更新的实用无模型算法——序贯离线自博弈镜像下降(SOS-MD)。我们证明SOS-MD的最后一次迭代能达到相同的$\widetilde{\mathcal{O}}(1/n)$统计速率,仅伴随随自博弈迭代次数$T$以$\widetilde{\mathcal{O}}(1/\sqrt{T})$速率消失的优化误差。