When deployed in the world, a learning agent such as a recommender system or a chatbot often repeatedly interacts with another learning agent (such as a user) over time. In many such two-agent systems, each agent learns separately and the rewards of the two agents are not perfectly aligned. To better understand such cases, we examine the learning dynamics of the two-agent system and the implications for each agent's objective. We model these systems as Stackelberg games with decentralized learning and show that standard regret benchmarks (such as Stackelberg equilibrium payoffs) result in worst-case linear regret for at least one player. To better capture these systems, we construct a relaxed regret benchmark that is tolerant to small learning errors by agents. We show that standard learning algorithms fail to provide sublinear regret, and we develop algorithms to achieve near-optimal $O(T^{2/3})$ regret for both players with respect to these benchmarks. We further design relaxed environments under which faster learning ($O(\sqrt{T})$) is possible. Altogether, our results take a step towards assessing how two-agent interactions in sequential and decentralized learning environments affect the utility of both agents.
翻译:当学习型智能体(如推荐系统或聊天机器人)部署到现实世界中时,通常需要随时间与另一个学习型智能体(例如用户)反复交互。在许多此类双智能体系统中,各智能体独立学习,且两者的回报并非完全一致。为更深入理解此类场景,我们考察了双智能体系统的学习动力学及其对各自目标的影响。我们将这些系统建模为具有去中心化学习机制的Stackelberg博弈,并证明标准遗憾指标(如Stackelberg均衡收益)会导致至少一个玩家面临最坏情况下的线性遗憾。为更准确描述此类系统,我们构建了能够容忍智能体微小学习误差的宽松遗憾指标。研究表明,标准学习算法无法实现次线性遗憾,我们据此开发了算法,使两个玩家相对于该指标均能达到近优的$O(T^{2/3})$遗憾值。我们进一步设计了可使学习速度加快(达到$O(\sqrt{T})$)的宽松环境。总体而言,我们的研究成果为评估序列化与去中心化学习环境中双智能体交互对双方效用的影响提供了重要进展。