We consider a number of questions related to tradeoffs between reward and regret in repeated gameplay between two agents. To facilitate this, we introduce a notion of $\textit{generalized equilibrium}$ which allows for asymmetric regret constraints, and yields polytopes of feasible values for each agent and pair of regret constraints, where we show that any such equilibrium is reachable by a pair of algorithms which maintain their regret guarantees against arbitrary opponents. As a central example, we highlight the case one agent is no-swap and the other's regret is unconstrained. We show that this captures an extension of $\textit{Stackelberg}$ equilibria with a matching optimal value, and that there exists a wide class of games where a player can significantly increase their utility by deviating from a no-swap-regret algorithm against a no-swap learner (in fact, almost any game without pure Nash equilibria is of this form). Additionally, we make use of generalized equilibria to consider tradeoffs in terms of the opponent's algorithm choice. We give a tight characterization for the maximal reward obtainable against $\textit{some}$ no-regret learner, yet we also show a class of games in which this is bounded away from the value obtainable against the class of common "mean-based" no-regret algorithms. Finally, we consider the question of learning reward-optimal strategies via repeated play with a no-regret agent when the game is initially unknown. Again we show tradeoffs depending on the opponent's learning algorithm: the Stackelberg strategy is learnable in exponential time with any no-regret agent (and in polynomial time with any no-$\textit{adaptive}$-regret agent) for any game where it is learnable via queries, and there are games where it is learnable in polynomial time against any no-swap-regret agent but requires exponential time against a mean-based no-regret agent.
翻译:我们探讨了重复博弈中两个智能体之间奖励与遗憾权衡的相关问题。为此,我们引入了$\textit{广义均衡}$的概念,该概念允许非对称遗憾约束,并为每个智能体及每对遗憾约束生成可行值的多面体集,同时证明任意此类均衡均可由一对维持其针对任意对手遗憾保证的算法达到。作为核心示例,我们重点讨论了当一方智能体采用无交换遗憾约束而另一方的遗憾无约束时的情况。研究表明,这捕捉了具有匹配最优值的$\textit{斯塔克尔伯格}$均衡的扩展形式,并且存在大量博弈场景,其中玩家通过偏离针对无交换学习者的无交换遗憾算法可显著提高自身效用(事实上,几乎所有无纯纳什均衡的博弈均属此类)。此外,我们利用广义均衡来研究对手算法选择方面的权衡。我们给出了针对$\textit{某些}$无遗憾学习者可获得的最大奖励的紧致刻画,同时发现有一类博弈中该值远低于针对常见“基于均值”无遗憾算法类所获得的值。最后,我们探讨了在博弈初始未知时通过与无遗憾智能体重复博弈来学习奖励最优策略的问题。我们再次展示了依赖对手学习算法的权衡:对于任何可通过查询学习的博弈,斯塔克尔伯格策略可通过指数时间(面对任意无自适应遗憾智能体时为多项式时间)与任意无遗憾智能体共同学习;同时存在某些博弈,该策略面对任意无交换遗憾智能体可在多项式时间内学习,但面对基于均值的无遗憾智能体则需指数时间。