Multi-agent deep reinforcement learning makes optimal decisions dependent on system states observed by agents, but any uncertainty on the observations may mislead agents to take wrong actions. The Mean-Field Actor-Critic reinforcement learning (MFAC) is well-known in the multi-agent field since it can effectively handle a scalability problem. However, it is sensitive to state perturbations that can significantly degrade the team rewards. This work proposes a Robust Mean-field Actor-Critic reinforcement learning (RoMFAC) that has two innovations: 1) a new objective function of training actors, composed of a \emph{policy gradient function} that is related to the expected cumulative discount reward on sampled clean states and an \emph{action loss function} that represents the difference between actions taken on clean and adversarial states; and 2) a repetitive regularization of the action loss, ensuring the trained actors to obtain excellent performance. Furthermore, this work proposes a game model named a State-Adversarial Stochastic Game (SASG). Despite the Nash equilibrium of SASG may not exist, adversarial perturbations to states in the RoMFAC are proven to be defensible based on SASG. Experimental results show that RoMFAC is robust against adversarial perturbations while maintaining its competitive performance in environments without perturbations.
翻译:多智能体深度强化学习依赖智能体观测到的系统状态做出最优决策,但观测中的任何不确定性都可能误导智能体采取错误行动。均值场演员-评论家强化学习(MFAC)因其能有效处理可扩展性问题而在多智能体领域广为人知,然而它对状态干扰敏感,可能显著降低团队奖励。本工作提出了一种鲁棒均值场演员-评论家强化学习(RoMFAC),其包含两项创新:1)一种新的演员训练目标函数,由与采样干净状态上的期望累积折扣奖励相关的策略梯度函数和表示干净状态与对抗状态下行动差异的行动损失函数组成;2)对行动损失的重复正则化,确保训练后的演员获得优异性能。此外,本工作提出了一种名为状态对抗随机博弈(SASG)的博弈模型。尽管SASG的纳什均衡可能不存在,但基于SASG可证明RoMFAC中针对状态的对抗性干扰是可防御的。实验结果表明,RoMFAC在无干扰环境中保持竞争性能的同时,对对抗性干扰具有鲁棒性。