We study online learning problems in which the learner has extra knowledge about the adversary's behaviour, i.e., in game-theoretic settings where opponents typically follow some no-external regret learning algorithms. Under this assumption, we propose two new online learning algorithms, Accurate Follow the Regularized Leader (AFTRL) and Prod-Best Response (Prod-BR), that intensively exploit this extra knowledge while maintaining the no-regret property in the worst-case scenario of having inaccurate extra information. Specifically, AFTRL achieves $O(1)$ external regret or $O(1)$ \emph{forward regret} against no-external regret adversary in comparison with $O(\sqrt{T})$ \emph{dynamic regret} of Prod-BR. To the best of our knowledge, our algorithm is the first to consider forward regret that achieves $O(1)$ regret against strategic adversaries. When playing zero-sum games with Accurate Multiplicative Weights Update (AMWU), a special case of AFTRL, we achieve \emph{last round convergence} to the Nash Equilibrium. We also provide numerical experiments to further support our theoretical results. In particular, we demonstrate that our methods achieve significantly better regret bounds and rate of last round convergence, compared to the state of the art (e.g., Multiplicative Weights Update (MWU) and its optimistic counterpart, OMWU).
翻译:我们研究学习者在博弈论环境中对对手行为拥有额外知识的在线学习问题,即对手通常遵循某种无外部遗憾学习算法。基于这一假设,我们提出了两种新的在线学习算法——精确跟随正则化领导者(AFTRL)与产出最佳响应(Prod-BR),这两种算法在充分利用额外知识的同时,仍能在不准确额外信息的最坏情况下保持无遗憾性质。具体而言,与Prod-BR的$O(\sqrt{T})$动态遗憾相比,AFTRL针对无外部遗憾对手实现了$O(1)$外部遗憾或$O(1)$前向遗憾。据我们所知,我们的算法是首个在面向策略对手时实现$O(1)$遗憾的前向遗憾算法。当使用精确乘法权重更新(AMWU,AFTRL的特例)进行零和博弈时,我们实现了纳什均衡的最后一轮收敛。我们还提供了数值实验以进一步支持理论结果。特别地,我们证明了相比现有最优方法(如乘法权重更新(MWU)及其乐观变体OMWU),我们的方法在遗憾界和最后一轮收敛速度方面均取得了显著更优的性能。