We study online learning in two-player uninformed Markov games, where the opponent's actions and policies are unobserved. In this setting, Tian et al. (2021) show that achieving no-external-regret is impossible without incurring an exponential dependence on the episode length $H$. They then turn to the weaker notion of Nash-value regret and propose a V-learning algorithm with regret $O(K^{2/3})$ after $K$ episodes. However, their algorithm and guarantee do not adapt to the difficulty of the problem: even in the case where the opponent follows a fixed policy and thus $O(\sqrt{K})$ external regret is well-known to be achievable, their result is still the worse rate $O(K^{2/3})$ on a weaker metric. In this work, we fully address both limitations. First, we introduce empirical Nash-value regret, a new regret notion that is strictly stronger than Nash-value regret and naturally reduces to external regret when the opponent follows a fixed policy. Moreover, under this new metric, we propose a parameter-free algorithm that achieves an $O(\min \{\sqrt{K} + (CK)^{1/3},\sqrt{LK}\})$ regret bound, where $C$ quantifies the variance of the opponent's policies and $L$ denotes the number of policy switches (both at most $O(K)$). Therefore, our results not only recover the two extremes -- $O(\sqrt{K})$ external regret when the opponent is fixed and $O(K^{2/3})$ Nash-value regret in the worst case -- but also smoothly interpolate between these extremes by automatically adapting to the opponent's non-stationarity. We achieve so by first providing a new analysis of the epoch-based V-learning algorithm by Mao et al. (2022), establishing an $O(ηC + \sqrt{K/η})$ regret bound, where $η$ is the epoch incremental factor. Next, we show how to adaptively restart this algorithm with an appropriate $η$ in response to the potential non-stationarity of the opponent, eventually achieving our final results.
翻译:我们研究双人无信息马尔可夫博弈中的在线学习问题,其中对手的动作与策略不可观测。在此设定下,Tian等人(2021)证明,若不引入对回合长度$H$的指数依赖,则无法实现无外部遗憾。他们随后转向较弱的纳什值遗憾概念,并提出一种V-learning算法,在$K$回合后达到$O(K^{2/3})$的遗憾界。然而,该算法及其保证未能适应问题的难度:即使在对手遵循固定策略因而已知可实现$O(\sqrt{K})$外部遗憾的情况下,他们在更弱度量上仍得到较差的$O(K^{2/3})$收敛率。本文中,我们完整解决了这两个局限。首先,我们提出经验纳什值遗憾这一新遗憾概念,它严格强于纳什值遗憾,并在对手遵循固定策略时自然退化为外部遗憾。此外,基于这一新度量,我们提出一种无参数算法,达到$O(\min \{\sqrt{K} + (CK)^{1/3},\sqrt{LK}\})$的遗憾界,其中$C$量化对手策略的方差,$L$表示策略切换次数(两者至多为$O(K)$)。因此,我们的结果不仅覆盖了两个极端情况——对手固定时的$O(\sqrt{K})$外部遗憾与最坏情况下的$O(K^{2/3})$纳什值遗憾——还能通过自动适应对手的非平稳性,在这些极端之间平滑过渡。为实现这一目标,我们首先对Mao等人(2022)提出的基于周期的V-learning算法进行了新分析,建立了$O(ηC + \sqrt{K/η})$的遗憾界,其中$η$为周期增长因子。随后,我们展示了如何根据对手潜在的非平稳性,以适当的$η$值自适应重启该算法,最终达成我们的最终结果。