We consider a number of questions related to tradeoffs between reward and regret in repeated gameplay between two agents. To facilitate this, we introduce a notion of {\it generalized equilibrium} which allows for asymmetric regret constraints, and yields polytopes of feasible values for each agent and pair of regret constraints, where we show that any such equilibrium is reachable by a pair of algorithms which maintain their regret guarantees against arbitrary opponents. As a central example, we highlight the case one agent is no-swap and the other's regret is unconstrained. We show that this captures an extension of {\it Stackelberg} equilibria with a matching optimal value, and that there exists a wide class of games where a player can significantly increase their utility by deviating from a no-swap-regret algorithm against a no-swap learner (in fact, almost any game without pure Nash equilibria is of this form). Additionally, we make use of generalized equilibria to consider tradeoffs in terms of the opponent's algorithm choice. We give a tight characterization for the maximal reward obtainable against {\it some} no-regret learner, yet we also show a class of games in which this is bounded away from the value obtainable against the class of common ``mean-based'' no-regret algorithms. Finally, we consider the question of learning reward-optimal strategies via repeated play with a no-regret agent when the game is initially unknown. Again we show tradeoffs depending on the opponent's learning algorithm: the Stackelberg strategy is learnable in exponential time with any no-regret agent (and in polynomial time with any no-{\it adaptive}-regret agent) for any game where it is learnable via queries, and there are games where it is learnable in polynomial time against any no-swap-regret agent but requires exponential time against a mean-based no-regret agent.
翻译:我们考虑了在双智能体重复博弈中奖励与遗憾之间的权衡相关问题。为此,我们引入了{\it广义均衡}的概念,该概念允许存在非对称的遗憾约束,并为每个智能体及每对遗憾约束生成可行值的多面体,其中我们证明了任何此类均衡均可通过一对算法实现,这些算法在面对任意对手时仍能保持其遗憾保证。作为一个核心例子,我们重点分析了某一智能体不交换遗憾、另一智能体遗憾无约束的情况。我们证明,这捕获了具有匹配最优值的{\it Stackelberg}均衡的扩展形式,并且存在一类广泛的博弈,其中玩家通过偏离不交换遗憾算法(实际上,几乎任何没有纯纳什均衡的博弈均属此类)与不交换遗憾学习者对抗,可以显著提高自身效用。此外,我们利用广义均衡来考虑对手算法选择方面的权衡。我们给出了针对{\it某些}无遗憾学习者可获得最大奖励的严格刻画,同时展示了存在一类博弈,其中该奖励值上限与针对常见“基于均值”无遗憾算法可获得的值存在差距。最后,我们考虑了在初始未知博弈中通过与无遗憾智能体重复博弈来学习奖励最优策略的问题。我们再次展示了取决于对手学习算法的权衡:通过查询可学习任何博弈时,Stackelberg策略可在指数时间内与任何无遗憾智能体(以及在多项式时间内与任何无{\it自适应}遗憾智能体)共同习得;同时存在一些博弈,其中该策略可在多项式时间内与任何不交换遗憾智能体共同习得,但需要指数时间才能与基于均值的无遗憾智能体共同习得。