We study offline learning in KL-regularized two-player zero-sum games, where policies are optimized with respect to a fixed reference policy through KL regularization. Prior work relies on pessimistic value estimation to handle distribution shift, yielding only $\widetilde{\mathcal{O}}(1/\sqrt n)$ statistical rates. We develop a new pessimism-free algorithm and analytical framework for KL-regularized games, built on the smoothness of KL-regularized best responses and a stability property of the Nash equilibrium induced by skew symmetry. This yields, to our knowledge, the first pessimism-free offline learning guarantee for KL-regularized games, with a fast $\widetilde{\mathcal{O}}(1/n)$ sample complexity bound. We further propose an efficient self-play policy optimization algorithm that replaces exact equilibrium computation with iterative KL-regularized policy updates, and prove that its last iterate preserves the same pessimism-free statistical guarantee up to a controlled optimization error.
翻译:我们研究KL正则化两人零和博弈的离线学习,其中策略通过KL正则化相对于固定参考策略进行优化。先前的工作依赖悲观值估计来处理分布偏移,仅能达到$\widetilde{\mathcal{O}}(1/\sqrt n)$的统计速率。我们为KL正则化博弈开发了一种无需悲观主义的新算法和分析框架,该框架基于KL正则化最优响应的平滑性以及由偏斜对称性引起的纳什均衡稳定性。据我们所知,这首次为KL正则化博弈提供了无需悲观主义的离线学习保证,并具有快速的$\widetilde{\mathcal{O}}(1/n)$样本复杂度界。我们进一步提出了一种高效的自对弈策略优化算法,该算法用迭代的KL正则化策略更新替代精确的均衡计算,并证明其最后一次迭代在受控优化误差范围内保持了相同的无悲观主义统计保证。