Regret Minimization with Adaptive Opponents in Repeated Games

In this paper, we study regret minimization in repeated games with \emph{adaptive} opponents who can respond based on histories of play. The standard metric of \emph{external regret} in online learning is known to fail to capture such adaptivity. To account for players' counterfactual reasoning, we introduce {\tt Repeated Policy Regret (RP-Regret)}, a game-theoretic metric that measures the difference between the \emph{realized} and the \emph{best-in-hindsight} accumulated utility when all players can \emph{respond} to the history of play. Compared to existing regret notions in this setting, ours is native to repeated game playing, enabling stronger comparators and opponents with fewer constraints, while maintaining the possibility of finding better equilibria when all players minimize it. We first identify necessary conditions for obtaining {\tt RP-Regret} sublinear in time, on the variation of the player's comparator strategies in the regret definition and on the memories of both the comparator and opponents' strategies. We then study additional conditions and provable algorithms to minimize {\tt RP-Regret}, which is by definition \emph{non-convex} in the strategy space. To address this challenge, we propose three algorithms: (i) one based on an optimization oracle, as assumed in some prior work in online non-convex learning; (ii) one that minimizes a convex and \emph{linearized} surrogate of {\tt RP-Regret} at each iteration; (iii) one that directly minimizes {\tt RP-Regret} when opponents change strategies slowly. Furthermore, when all players can run algorithms to minimize the {\tt RP-Regret} (or its linearized variant), certain subgame perfect equilibria of the repeated game can be learned. We also provide experiments showing that minimizing our regret notions can lead to more cooperative solutions with higher utility in games such as Stag-Hunt.

翻译：本文研究重复博弈中面对**自适应**对手（即能依据历史博弈过程做出响应的对手）时的遗憾最小化问题。在线学习中的标准**外部遗憾**指标已被证明无法刻画这种自适应性。为体现玩家的反事实推理，我们提出**重复策略遗憾（RP-Regret）**——一种博弈论度量指标，衡量当所有玩家均可对历史博弈过程**做出响应**时，其**实际累积效用**与**事后最优累积效用**之间的差异。相较于该领域的现有遗憾概念，本文提出的度量天然适用于重复博弈情境，既能支持更强的比较对象与约束更少的对手，又能在所有玩家均最小化该遗憾时保留发现更优均衡的可能性。我们首先确定了实现时间次线性的**RP-Regret**所需的必要条件：涉及遗憾定义中玩家比较策略的变化幅度，以及比较策略与对手策略的记忆长度。随后，我们研究额外条件并设计可证明的算法以最小化**RP-Regret**——该目标在策略空间中天然具有**非凸性**。为应对这一挑战，我们提出三种算法：（一）基于优化预言机（部分在线非凸学习研究中的假设）的算法；（二）每轮迭代最小化**RP-Regret**的凸**线性化**代理项的算法；（三）当对手策略缓慢变化时直接最小化**RP-Regret**的算法。进一步地，当所有玩家均可运行最小化**RP-Regret**（或其线性化变体）的算法时，重复博弈的某些子博弈完美均衡可被学习。实验表明，在诸如猎鹿博弈中，最小化本文提出的遗憾概念可引导出具有更高效用的合作解。