Scale-invariance in games has recently emerged as a widely valued desirable property. Yet, almost all fast convergence guarantees in learning in games require prior knowledge of the utility scale. To address this, we develop learning dynamics that achieve fast convergence while being both scale-free, requiring no prior information about utilities, and scale-invariant, remaining unchanged under positive rescaling of utilities. For two-player zero-sum games, we obtain scale-free and scale-invariant dynamics with external regret bounded by $\tilde{O}(A_{\mathrm{diff}})$, where $A_{\mathrm{diff}}$ is the payoff range, which implies an $\tilde{O}(A_{\mathrm{diff}} / T)$ convergence rate to Nash equilibrium after $T$ rounds. For multiplayer general-sum games with $n$ players and $m$ actions, we obtain scale-free and scale-invariant dynamics with swap regret bounded by $O(U_{\mathrm{max}} \log T)$, where $U_{\mathrm{max}}$ is the range of the utilities, ignoring the dependence on the number of players and actions. This yields an $O(U_{\mathrm{max}} \log T / T)$ convergence rate to correlated equilibrium. Our learning dynamics are based on optimistic follow-the-regularized-leader with an adaptive learning rate that incorporates the squared path length of the opponents' gradient vectors, together with a new stopping-time analysis that exploits negative terms in regret bounds without scale-dependent tuning. For general-sum games, scale-free learning is enabled also by a technique called doubling clipping, which clips observed gradients based on past observations.
翻译:尺度不变性近来已成为博弈中广受重视的理想性质。然而,几乎所有博弈学习中快速收敛的保证都需要预先获知效用尺度信息。为解决此问题,我们开发了既能实现快速收敛,又同时具备尺度无关性(无需效用先验信息)和尺度不变性(在效用正缩放下保持不变)的学习动态。针对双人零和博弈,我们获得了外部遗憾界为 $\tilde{O}(A_{\mathrm{diff}})$ 的尺度无关且尺度不变的动态,其中 $A_{\mathrm{diff}}$ 为收益范围,这意味着经过 $T$ 轮后能以 $\tilde{O}(A_{\mathrm{diff}} / T)$ 的速率收敛至纳什均衡。对于具有 $n$ 个参与者和 $m$ 个动作的多人一般和博弈,我们获得了交换遗憾界为 $O(U_{\mathrm{max}} \log T)$ 的尺度无关且尺度不变的动态,其中 $U_{\mathrm{max}}$ 为效用范围(忽略对参与者数量和动作数的依赖)。这产生了以 $O(U_{\mathrm{max}} \log T / T)$ 的速率向相关均衡的收敛。我们的学习动态基于具有自适应学习率的乐观跟随正则化领导者算法,该学习率融入了对手梯度向量的平方路径长度,并结合了一种新的停时分析技术,该技术能够利用遗憾界中的负项而无需进行尺度依赖的调参。对于一般和博弈,尺度无关学习还通过一种称为倍减截断的技术实现,该技术基于历史观测值对观测梯度进行截断处理。