In multi-agent reinforcement learning (MARL), agents repeatedly interact across time and revise their strategies as new data arrives, producing a sequence of strategy profiles. This paper studies sequences of strategies satisfying a pairwise constraint inspired by policy updating in reinforcement learning, where an agent who is best responding in period $t$ does not switch its strategy in the next period $t+1$. This constraint merely requires that optimizing agents do not switch strategies, but does not constrain the other non-optimizing agents in any way, and thus allows for exploration. Sequences with this property are called satisficing paths, and arise naturally in many MARL algorithms. A fundamental question about strategic dynamics is such: for a given game and initial strategy profile, is it always possible to construct a satisficing path that terminates at an equilibrium strategy? The resolution of this question has implications about the capabilities or limitations of a class of MARL algorithms. We answer this question in the affirmative for mixed extensions of finite normal-form games.%
翻译:在多智能体强化学习中,智能体随时间反复交互,并根据新到达的数据调整自身策略,从而产生一系列策略组合。本文研究满足由强化学习策略更新所启发的一组成对约束的策略序列,其中在周期$t$中做出最优反应的智能体,不会在下一周期$t+1$切换其策略。该约束仅要求优化中的智能体不改变策略,但对其他非优化智能体不作任何限制,因此允许探索行为。具备这一性质的序列称为满意路径,并自然出现在许多多智能体强化学习算法中。关于策略动态的一个基本问题如下:对于给定的博弈和初始策略组合,是否总能构建一条终止于均衡策略的满意路径?该问题的解答对一类多智能体强化学习算法的能力或局限性具有启示意义。本文对有限正规博弈的混合扩展给出肯定回答。