In cooperative multi-agent reinforcement learning (MARL), where an agent coordinates with teammate(s) for a shared goal, it may sustain non-stationary caused by the policy change of teammates. Prior works mainly concentrate on the policy change during the training phase or teammates altering cross episodes, ignoring the fact that teammates may suffer from policy change suddenly within an episode, which might lead to miscoordination and poor performance as a result. We formulate the problem as an open Dec-POMDP, where we control some agents to coordinate with uncontrolled teammates, whose policies could be changed within one episode. Then we develop a new framework, fast teammates adaptation (Fastap), to address the problem. Concretely, we first train versatile teammates' policies and assign them to different clusters via the Chinese Restaurant Process (CRP). Then, we train the controlled agent(s) to coordinate with the sampled uncontrolled teammates by capturing their identifications as context for fast adaptation. Finally, each agent applies its local information to anticipate the teammates' context for decision-making accordingly. This process proceeds alternately, leading to a robust policy that can adapt to any teammates during the decentralized execution phase. We show in multiple multi-agent benchmarks that Fastap can achieve superior performance than multiple baselines in stationary and non-stationary scenarios.
翻译:在合作型多智能体强化学习(MARL)中,当智能体与队友为共同目标协同配合时,可能因队友策略变化而面临非平稳性问题。现有研究主要关注训练阶段的策略变化或跨回合的队友切换,却忽略了队友可能在同回合内突发策略变化的情况,这可能导致协作失调及性能下降。我们将该问题建模为开放式去中心化部分可观测马尔可夫决策过程(open Dec-POMDP),其中我们控制部分智能体与不受控队友协同,而后者的策略可在单回合内发生变化。为此,我们提出新型框架——快速队友适应(Fastap)。具体而言,首先通过中国餐馆过程(CRP)训练多样化的队友策略并将其分配至不同聚类;随后通过捕获不受控队友的标识作为上下文信息,训练受控智能体与之协同实现快速适应;最后各智能体利用局部信息预测队友上下文以作出决策。该过程交替进行,最终形成可在去中心化执行阶段适应任意队友的鲁棒策略。在多智能体基准测试中,我们证明Fastap在平稳与非平稳场景下均能取得优于多种基线的性能。