Maximizing utility in multi-agent environments by anticipating the behavior of other learners

Learning algorithms are often used to make decisions in sequential decision-making environments. In multi-agent settings, the decisions of each agent can affect the utilities/losses of the other agents. Therefore, if an agent is good at anticipating the behavior of the other agents, in particular how they will make decisions in each round as a function of their experience that far, it could try to judiciously make its own decisions over the rounds of the interaction so as to influence the other agents to behave in a way that ultimately benefits its own utility. In this paper, we study repeated two-player games involving two types of agents: a learner, which employs an online learning algorithm to choose its strategy in each round; and an optimizer, which knows the learner's utility function and the learner's online learning algorithm. The optimizer wants to plan ahead to maximize its own utility, while taking into account the learner's behavior. We provide two results: a positive result for repeated zero-sum games and a negative result for repeated general-sum games. Our positive result is an algorithm for the optimizer, which exactly maximizes its utility against a learner that plays the Replicator Dynamics -- the continuous-time analogue of Multiplicative Weights Update (MWU). Additionally, we use this result to provide an algorithm for the optimizer against MWU, i.e.~for the discrete-time setting, which guarantees an average utility for the optimizer that is higher than the value of the one-shot game. Our negative result shows that, unless P=NP, there is no Fully Polynomial Time Approximation Scheme (FPTAS) for maximizing the utility of an optimizer against a learner that best-responds to the history in each round. Yet, this still leaves open the question of whether there exists a polynomial-time algorithm that optimizes the utility up to $o(T)$.

翻译：学习算法常被用于序列决策环境中进行决策。在多智能体场景中，每个智能体的决策都可能影响其他智能体的效用/损失。因此，若某个智能体擅长预测其他智能体的行为——特别是它们将如何根据当前经验在每一轮中做出决策——该智能体便可在多轮交互中审慎地制定自身决策，从而引导其他智能体采取最终有利于自身效用的行为方式。本文研究涉及两类智能体的重复双人博弈：学习者（采用在线学习算法在每轮中选择策略）与优化者（知晓学习者的效用函数及其在线学习算法）。优化者需通过前瞻性规划来最大化自身效用，同时考虑学习者的行为模式。我们提出两项研究成果：针对重复零和博弈的正面结论与针对重复一般和博弈的负面结论。正面结论是为优化者设计的算法，可精确最大化其对抗采用复制动态（即乘性权重更新在连续时间下的类比形式）学习者的效用。基于此，我们进一步提出优化者在离散时间场景下对抗乘性权重更新的算法，该算法保证优化者获得的平均效用高于单次博弈的值。负面结论表明：除非P=NP，否则不存在完全多项式时间近似方案来最大化优化者对抗每轮根据历史最优响应学习者的效用。然而，这仍遗留一个开放问题：是否存在多项式时间算法能以$o(T)$的精度优化效用。