In this study, we explore the robustness of cooperative multi-agent reinforcement learning (c-MARL) against Byzantine failures, where any agent can enact arbitrary, worst-case actions due to malfunction or adversarial attack. To address the uncertainty that any agent can be adversarial, we propose a Bayesian Adversarial Robust Dec-POMDP (BARDec-POMDP) framework, which views Byzantine adversaries as nature-dictated types, represented by a separate transition. This allows agents to learn policies grounded on their posterior beliefs about the type of other agents, fostering collaboration with identified allies and minimizing vulnerability to adversarial manipulation. We define the optimal solution to the BARDec-POMDP as an ex post robust Bayesian Markov perfect equilibrium, which we proof to exist and weakly dominates the equilibrium of previous robust MARL approaches. To realize this equilibrium, we put forward a two-timescale actor-critic algorithm with almost sure convergence under specific conditions. Experimentation on matrix games, level-based foraging and StarCraft II indicate that, even under worst-case perturbations, our method successfully acquires intricate micromanagement skills and adaptively aligns with allies, demonstrating resilience against non-oblivious adversaries, random allies, observation-based attacks, and transfer-based attacks.
翻译:在本研究中,我们探讨了合作多智能体强化学习(c-MARL)在面对拜占庭故障时的鲁棒性——即任一智能体因故障或恶意攻击可能采取任意最坏情况行动的问题。为应对任何智能体都可能具有对抗性的不确定性,我们提出了贝叶斯对抗鲁棒Dec-POMDP(BARDec-POMDP)框架,该框架将拜占庭对手视为由自然决定的类型,通过独立转移表示。这使得智能体能够基于对其他智能体类型的后验信念学习策略,从而与识别的盟友协作,并最小化对抗操纵的脆弱性。我们将BARDec-POMDP的最优解定义为事后鲁棒贝叶斯马尔可夫完美均衡,并证明其存在性且弱支配先前鲁棒MARL方法的均衡。为实现该均衡,我们提出了一种两时间尺度演员-评论家算法,该算法在特定条件下几乎必然收敛。在矩阵博弈、基于层级的觅食和星际争霸II上的实验表明,即使在最坏情况扰动下,我们的方法也能成功获取复杂的微操技能并自适应地与盟友对齐,展现出对非遗忘型对手、随机盟友、基于观测的攻击及基于迁移的攻击的鲁棒性。