We study offline learning in KL-regularized two-player zero-sum games, where policies are optimized under a KL constraint to a fixed reference policy. Prior work relies on pessimistic value estimation to handle distribution shift, yielding only $\widetilde{\mathcal{O}}(1/\sqrt n)$ statistical rates. We develop a new pessimism-free algorithm and analytical framework for KL-regularized games, built on the smoothness of KL-regularized best responses and a stability property of the Nash equilibrium induced by skew symmetry. This yields the first $\widetilde{\mathcal{O}}(1/n)$ sample complexity bound for offline learning in KL-regularized zero-sum games, achieved entirely without pessimism. We further propose an efficient self-play policy optimization algorithm and prove that, with a number of iterations linear in the sample size, it achieves the same fast $\widetilde{\mathcal{O}}(1/n)$ statistical rate as the minimax estimator.
翻译:我们研究了KL正则化两人零和博弈中的离线学习问题,其中策略在KL约束下相对于固定参考策略进行优化。先前的工作依赖悲观值估计来处理分布偏移,仅能达到$\widetilde{\mathcal{O}}(1/\sqrt n)$的统计速率。我们开发了一种新的无悲观主义算法及分析框架,用于KL正则化博弈,该框架基于KL正则化最优响应的光滑性以及由斜对称性引起的纳什均衡稳定性性质。这首次实现了KL正则化零和博弈离线学习中的$\widetilde{\mathcal{O}}(1/n)$样本复杂度界,且完全无需悲观主义。我们进一步提出了一种高效的自对弈策略优化算法,并证明其迭代次数与样本量呈线性关系,能够达到与极小化极大估计器相同的快速$\widetilde{\mathcal{O}}(1/n)$统计速率。