When deployed in the world, a learning agent such as a recommender system or a chatbot often repeatedly interacts with another learning agent (such as a user) over time. In many such two-agent systems, each agent learns separately and the rewards of the two agents are not perfectly aligned. To better understand such cases, we examine the learning dynamics of the two-agent system and the implications for each agent's objective. We model these systems as Stackelberg games with decentralized learning and show that standard regret benchmarks (such as Stackelberg equilibrium payoffs) result in worst-case linear regret for at least one player. To better capture these systems, we construct a relaxed regret benchmark that is tolerant to small learning errors by agents. We show that standard learning algorithms fail to provide sublinear regret, and we develop algorithms to achieve near-optimal $O(T^{2/3})$ regret for both players with respect to these benchmarks. We further design relaxed environments under which faster learning ($O(\sqrt{T})$) is possible. Altogether, our results take a step towards assessing how two-agent interactions in sequential and decentralized learning environments affect the utility of both agents.
翻译:当推荐系统或聊天机器人等学习智能体在现实世界中部署时,通常会与另一个学习智能体(例如用户)随时间反复交互。在此类双智能体系统中,每个智能体独立学习,且两者的奖励函数并不完全一致。为深入理解此类场景,我们研究双智能体系统的学习动态及其对各智能体目标的影响。我们将这些系统建模为具有去中心化学习的斯塔克尔伯格博弈,证明标准遗憾基准(如斯塔克尔伯格均衡收益)将导致至少一方参与者出现最坏情况下的线性遗憾。为更准确刻画此类系统,我们构建了一种容忍智能体微小学习误差的松弛遗憾基准。研究表明,标准学习算法无法实现次线性遗憾,为此我们开发了新算法,使双方参与者相对于该基准均能达到近乎最优的$O(T^{2/3})$遗憾上界。我们进一步设计了可实现更快学习速率($O(\sqrt{T})$)的松弛环境。综合而言,本研究为评估序贯去中心化学习环境中双智能体交互如何影响双方效用提供了理论依据。