We study the problem of learning the utility functions of no-regret learning agents in a repeated normal-form game. Differing from most prior literature, we introduce a principal with the power to observe the agents playing the game, send agents signals, and give agents payments as a function of their actions. We show that the principal can, using a number of rounds polynomial in the size of the game, learn the utility functions of all agents to any desired precision $ε> 0$, for any no-regret learning algorithms of the agents. Our main technique is to formulate a zero-sum game between the principal and the agents, where the principal chooses strategies among the set of all payment functions to minimize the agent's payoff. Finally, we discuss implications for the problem of steering agents. We introduce, using our utility-learning algorithm as a subroutine, the first algorithm for steering arbitrary no-regret learning agents to a desired equilibrium without prior knowledge of their utility functions.
翻译:本文研究了在重复标准型博弈中学习无悔学习智能体效用函数的问题。与大多数现有文献不同,我们引入了一个具备以下能力的主体:观察智能体进行博弈、向智能体发送信号、并根据智能体行动函数给予报酬。我们证明,对于任意无悔学习算法的智能体,该主体能够以博弈规模的多项式轮次数,将所有智能体的效用函数学习到任意期望精度 $ε> 0$。我们的核心技术是构建主体与智能体之间的零和博弈框架,其中主体通过在所有报酬函数集合中选择策略来最小化智能体的收益。最后,我们探讨了该研究对智能体引导问题的启示。通过将我们的效用学习算法作为子程序,我们首次提出了在未知智能体效用函数的情况下,引导任意无悔学习智能体达到期望均衡的算法。