When learning a task as a team, some agents in Multi-Agent Reinforcement Learning (MARL) may fail to understand their true impact in the performance of the team. Such agents end up learning sub-optimal policies, demonstrating undesired lazy behaviours. To investigate this problem, we start by formalising the use of temporal causality applied to MARL problems. We then show how causality can be used to penalise such lazy agents and improve their behaviours. By understanding how their local observations are causally related to the team reward, each agent in the team can adjust their individual credit based on whether they helped to cause the reward or not. We show empirically that using causality estimations in MARL improves not only the holistic performance of the team, but also the individual capabilities of each agent. We observe that the improvements are consistent in a set of different environments.
翻译:在多智能体强化学习(MARL)中,当团队共同学习执行任务时,部分智能体可能无法理解自身对团队性能的真实影响。这类智能体最终学习到次优策略,表现出不良的懒惰行为。为探究此问题,我们首先形式化定义了时间因果关系在MARL问题中的应用。随后展示了如何利用因果关系惩罚此类懒惰智能体并改善其行为。通过理解局部观测与团队奖励之间的因果关联,团队中每个智能体可根据自身是否对奖励产生贡献来调整个人信用值。实验表明,在MARL中运用因果估计不仅能提升团队的整体性能,还能增强每个智能体的个体能力。我们观察到,在不同环境设置下,这些性能提升具有一致性。