In this study, we explore the robustness of cooperative multi-agent reinforcement learning (c-MARL) against Byzantine failures, where any agent can enact arbitrary, worst-case actions due to malfunction or adversarial attack. To address the uncertainty that any agent can be adversarial, we propose a Bayesian Adversarial Robust Dec-POMDP (BARDec-POMDP) framework, which views Byzantine adversaries as nature-dictated types, represented by a separate transition. This allows agents to learn policies grounded on their posterior beliefs about the type of other agents, fostering collaboration with identified allies and minimizing vulnerability to adversarial manipulation. We define the optimal solution to the BARDec-POMDP as an ex post robust Bayesian Markov perfect equilibrium, which we proof to exist and weakly dominates the equilibrium of previous robust MARL approaches. To realize this equilibrium, we put forward a two-timescale actor-critic algorithm with almost sure convergence under specific conditions. Experimentation on matrix games, level-based foraging and StarCraft II indicate that, even under worst-case perturbations, our method successfully acquires intricate micromanagement skills and adaptively aligns with allies, demonstrating resilience against non-oblivious adversaries, random allies, observation-based attacks, and transfer-based attacks.
翻译:在本研究中,我们探讨了合作式多智能体强化学习(c-MARL)针对拜占庭故障的鲁棒性,其中任何智能体可能因故障或对抗性攻击而执行任意的、最坏情况下的动作。为应对任意智能体均可能具有对抗性的不确定性,我们提出了一种贝叶斯对抗鲁棒去中心化部分可观测马尔可夫决策过程(BARDec-POMDP)框架,该框架将拜占庭对抗者视为由自然决定的类型,并通过独立的转移过程表示。这使得智能体能够基于其对其他智能体类型的后验信念学习策略,从而促进与已识别盟友的协作,并最小化对对抗性操控的脆弱性。我们将BARDec-POMDP的最优解定义为事后鲁棒贝叶斯马尔可夫完美均衡,并证明该均衡存在且弱优于先前鲁棒多智能体强化学习方法中的均衡。为实现该均衡,我们提出了一种双时间尺度演员-评论家算法,该算法在特定条件下几乎必然收敛。在矩阵博弈、基于物品的觅食任务以及星际争霸II上的实验表明,即使在最坏情况下的扰动中,我们的方法也能成功习得复杂的微操技能并自适应地与盟友协同,展现出对非遗忘型对手、随机盟友、基于观测的攻击以及基于迁移的攻击的韧性。