Scale-Invariant Regret Matching and Online Learning with Optimal Convergence: Bridging Theory and Practice in Zero-Sum Games

from arxiv, Compared to the previous version, this version includes new results on harmonic games and extensive-form games. Abstract abridged due to arXiv length constraints

A considerable chasm has been looming for decades between theory and practice in zero-sum game solving through first-order methods. Although a convergence rate of $T^{-1}$ has long been established, the most effective paradigm in practice is counterfactual regret minimization (CFR), which is based on regret matching and its modern variants. In particular, the state of the art across most benchmarks is predictive regret matching$^+$ (PRM$^+$). Yet, such algorithms can exhibit slower $T^{-1/2}$ convergence even in self-play. In this paper, we close the gap between theory and practice. We propose a new scale-invariant and parameter-free variant of PRM$^+$, which we call IREG-PRM$^+$. We show that it achieves $T^{-1/2}$ best-iterate and $T^{-1}$ (i.e., optimal) average-iterate convergence guarantees, while also being on par or even better relative to PRM$^+$ on benchmark games. From a technical standpoint, we draw an analogy between (IREG-)PRM$^+$ and optimistic gradient descent with adaptive learning rate. Reflecting this theoretical bridge, we find that the adaptive version of optimistic gradient descent we consider performs on par with IREG-PRM$^+$. This demystifies the effectiveness of the regret matching family vis-a-vis more standard optimization techniques. Moreover, we extend our analysis beyond zero-sum games to a family of variational inequality problems that includes harmonic games, as well as extensive-form games with fully-mixed equilibria, via a new and intriguing connection between CFR and harmonic games. Unlike prior work in harmonic games, our algorithms do not require knowing the underlying weights by virtue of scale invariance. Under the weighted Minty condition, we show that any algorithm satisfying a scale-invariant RVU property (such as IREG-PRM$^+$) has constant regret (in self-play) and $T^{-1/2}$ iterate convergence.

翻译：数十年来，在通过一阶方法求解零和博弈方面，理论与实践之间一直存在着巨大的鸿沟。尽管早已确立了 $T^{-1}$ 的收敛速率，但实践中最有效的范式是基于后悔匹配及其现代变体的反事实后悔最小化（CFR）。具体而言，在大多数基准测试中表现最优的是预测性后悔匹配$^+$（PRM$^+$）。然而，即使是在自我对弈中，此类算法也可能表现出较慢的 $T^{-1/2}$ 收敛性。在本文中，我们弥合了理论与实践之间的差距。我们提出了一种新的尺度不变且无参数的 PRM$^+$ 变体，称之为 IREG-PRM$^+$。我们证明了它实现了 $T^{-1/2}$ 的最佳迭代和 $T^{-1}$（即最优的）平均迭代收敛保证，同时在基准博弈上的表现与 PRM$^+$ 相当甚至更优。从技术角度来看，我们揭示了（IREG-）PRM$^+$ 与具有自适应学习率的乐观梯度下降之间的类比。基于这一理论桥梁，我们发现我们所考虑的自适应版本乐观梯度下降与 IREG-PRM$^+$ 表现相当。这揭示了后悔匹配家族相对于更标准优化技术的有效性。此外，通过 CFR 与调和博弈之间一个新的、有趣的关联，我们将分析扩展到零和博弈之外，涵盖了一类包含调和博弈以及具有完全混合均衡的扩展式博弈的变分不等式问题。与调和博弈中的先前工作不同，我们的算法凭借尺度不变性，无需知晓底层权重。在加权 Minty 条件下，我们证明了任何满足尺度不变 RVU 性质（如 IREG-PRM$^+$）的算法都具有常数后悔（在自我对弈中）和 $T^{-1/2}$ 的迭代收敛性。