Cooperative Multi-Agent Reinforcement Learning (MARL) algorithms, trained only to optimize task reward, can lead to a concentration of power where the failure or adversarial intent of a single agent could decimate the reward of every agent in the system. In the context of teams of people, it is often useful to explicitly consider how power is distributed to ensure no person becomes a single point of failure. Here, we argue that explicitly regularizing the concentration of power in cooperative RL systems can result in systems which are more robust to single agent failure, adversarial attacks, and incentive changes of co-players. To this end, we define a practical pairwise measure of power that captures the ability of any co-player to influence the ego agent's reward, and then propose a power-regularized objective which balances task reward and power concentration. Given this new objective, we show that there always exists an equilibrium where every agent is playing a power-regularized best-response balancing power and task reward. Moreover, we present two algorithms for training agents towards this power-regularized objective: Sample Based Power Regularization (SBPR), which injects adversarial data during training; and Power Regularization via Intrinsic Motivation (PRIM), which adds an intrinsic motivation to regulate power to the training objective. Our experiments demonstrate that both algorithms successfully balance task reward and power, leading to lower power behavior than the baseline of task-only reward and avoid catastrophic events in case an agent in the system goes off-policy.
翻译:仅以优化任务奖励为目标训练的合作多智能体强化学习(MARL)算法,可能导致权力集中现象:单个智能体的故障或对抗性意图可能摧毁系统中所有智能体的奖励。在人类团队协作的语境中,明确考虑权力分配以确保无人成为单点故障通常具有重要价值。本文主张,在合作强化学习系统中显式正则化权力集中度,能够构建出对单智能体故障、对抗攻击以及合作者激励变化更具鲁棒性的系统。为此,我们定义了一种实用的成对权力度量方法,用以刻画任意合作者影响目标智能体奖励的能力,进而提出平衡任务奖励与权力集中的权力正则化目标。基于该新目标,我们证明始终存在一个均衡点,使得每个智能体都在执行平衡权力与任务奖励的正则化最优响应。此外,我们提出两种面向该目标的训练算法:基于采样的权力正则化(SBPR)通过在训练中注入对抗性数据实现;以及通过内在动机的权力正则化(PRIM)通过向训练目标添加调节权力的内在动机实现。实验表明,两种算法均能成功平衡任务奖励与权力分配,相比仅优化任务奖励的基线方法产生了更低权力的行为,并在系统中有智能体策略偏离时避免了灾难性后果。