In this study, we explore the robustness of cooperative multi-agent reinforcement learning (c-MARL) against Byzantine failures, where any agent can enact arbitrary, worst-case actions due to malfunction or adversarial attack. To address the uncertainty that any agent can be adversarial, we propose a Bayesian Adversarial Robust Dec-POMDP (BARDec-POMDP) framework, which views Byzantine adversaries as nature-dictated types, represented by a separate transition. This allows agents to learn policies grounded on their posterior beliefs about the type of other agents, fostering collaboration with identified allies and minimizing vulnerability to adversarial manipulation. We define the optimal solution to the BARDec-POMDP as an ex post robust Bayesian Markov perfect equilibrium, which we proof to exist and weakly dominates the equilibrium of previous robust MARL approaches. To realize this equilibrium, we put forward a two-timescale actor-critic algorithm with almost sure convergence under specific conditions. Experimentation on matrix games, level-based foraging and StarCraft II indicate that, even under worst-case perturbations, our method successfully acquires intricate micromanagement skills and adaptively aligns with allies, demonstrating resilience against non-oblivious adversaries, random allies, observation-based attacks, and transfer-based attacks.
翻译:本研究探讨了协作多智能体强化学习(c-MARL)在拜占庭故障下的鲁棒性,即由于故障或对抗性攻击,任何智能体都可能执行任意的、最坏情况下的动作。为应对任何智能体都可能具有对抗性的不确定性,我们提出了一个贝叶斯对抗鲁棒分散式部分可观测马尔可夫决策过程(BARDec-POMDP)框架,该框架将拜占庭对抗者视为由自然决定的类型,并通过一个独立的转移函数来表示。这使得智能体能够基于其对其他智能体类型的后验信念来学习策略,从而促进与已识别盟友的协作,并最大限度地降低受对抗性操纵的脆弱性。我们将BARDec-POMDP的最优解定义为事后鲁棒贝叶斯马尔可夫完美均衡,并证明其存在性,且弱占优于先前鲁棒MARL方法的均衡。为实现此均衡,我们提出了一种双时间尺度行动者-评论家算法,该算法在特定条件下几乎必然收敛。在矩阵博弈、基于等级的觅食任务以及《星际争霸II》上的实验表明,即使在最坏情况扰动下,我们的方法也能成功习得复杂的微操技能,并自适应地与盟友协同,展现出对非遗忘型对抗者、随机盟友、基于观测的攻击以及基于迁移的攻击的鲁棒性。