Cooperative multi-agent reinforcement learning (CMARL) has shown to be promising for many real-world applications. Previous works mainly focus on improving coordination ability via solving MARL-specific challenges (e.g., non-stationarity, credit assignment, scalability), but ignore the policy perturbation issue when testing in a different environment. This issue hasn't been considered in problem formulation or efficient algorithm design. To address this issue, we firstly model the problem as a limited policy adversary Dec-POMDP (LPA-Dec-POMDP), where some coordinators from a team might accidentally and unpredictably encounter a limited number of malicious action attacks, but the regular coordinators still strive for the intended goal. Then, we propose Robust Multi-Agent Coordination via Evolutionary Generation of Auxiliary Adversarial Attackers (ROMANCE), which enables the trained policy to encounter diversified and strong auxiliary adversarial attacks during training, thus achieving high robustness under various policy perturbations. Concretely, to avoid the ego-system overfitting to a specific attacker, we maintain a set of attackers, which is optimized to guarantee the attackers high attacking quality and behavior diversity. The goal of quality is to minimize the ego-system coordination effect, and a novel diversity regularizer based on sparse action is applied to diversify the behaviors among attackers. The ego-system is then paired with a population of attackers selected from the maintained attacker set, and alternately trained against the constantly evolving attackers. Extensive experiments on multiple scenarios from SMAC indicate our ROMANCE provides comparable or better robustness and generalization ability than other baselines.
翻译:协作多智能体强化学习(CMARL)在许多实际应用中展现出巨大潜力。以往研究主要集中于通过解决MARL特有挑战(如非平稳性、信用分配、可扩展性)来提升协调能力,但忽视了在不同环境中测试时可能出现的策略扰动问题。该问题在问题建模或高效算法设计中尚未被考虑。为解决此问题,我们首先将问题建模为有限策略对手Dec-POMDP(LPA-Dec-POMDP),其中团队中的部分协调者可能意外且不可预测地遭受有限次数的恶意动作攻击,而正常协调者仍努力实现既定目标。随后,我们提出基于演化生成辅助对抗攻击者的鲁棒多智能体协同(ROMANCE),该方法使训练后的策略在训练过程中能够遭遇多样化且强大的辅助对抗攻击,从而在各类策略扰动下实现高鲁棒性。具体而言,为避免自我系统过拟合特定攻击者,我们维护一个攻击者集合,并通过优化确保攻击者具有高攻击质量与行为多样性。攻击质量的目标是最大化降低自我系统的协调效果,同时采用基于稀疏动作的新型多样性正则化器来区分攻击者间的行为。自我系统随后与从维护的攻击者集合中选出的攻击者群体配对,并交替训练以对抗不断演化的攻击者。在SMAC多个场景上的大量实验表明,ROMANCE相比其他基线方法具有相当或更好的鲁棒性与泛化能力。