Efficient collaboration in the centralized training with decentralized execution (CTDE) paradigm remains a challenge in cooperative multi-agent systems. We identify divergent action tendencies among agents as a significant obstacle to CTDE's training efficiency, requiring a large number of training samples to achieve a unified consensus on agents' policies. This divergence stems from the lack of adequate team consensus-related guidance signals during credit assignments in CTDE. To address this, we propose Intrinsic Action Tendency Consistency, a novel approach for cooperative multi-agent reinforcement learning. It integrates intrinsic rewards, obtained through an action model, into a reward-additive CTDE (RA-CTDE) framework. We formulate an action model that enables surrounding agents to predict the central agent's action tendency. Leveraging these predictions, we compute a cooperative intrinsic reward that encourages agents to match their actions with their neighbors' predictions. We establish the equivalence between RA-CTDE and CTDE through theoretical analyses, demonstrating that CTDE's training process can be achieved using agents' individual targets. Building on this insight, we introduce a novel method to combine intrinsic rewards and CTDE. Extensive experiments on challenging tasks in SMAC and GRF benchmarks showcase the improved performance of our method.
翻译:在去中心化执行的集中训练范式下实现高效协作仍然是合作多智能体系统面临的挑战。我们发现智能体间行动倾向的分歧是阻碍CTDE训练效率的主要障碍,需要大量训练样本才能就智能体策略达成统一共识。这种分歧源于CTDE信用分配过程中缺乏足够的团队共识相关引导信号。为解决该问题,我们提出内在行动倾向一致性——一种用于协作多智能体强化学习的新方法。该方法将通过行动模型获得的内在奖励集成到奖励叠加式CTDE框架中。我们构建了一个行动模型,使周围智能体能够预测中心智能体的行动倾向。基于这些预测,我们计算出一个协作内在奖励,激励智能体使其行动与邻居预测保持一致。通过理论分析,我们证明了RA-CTDE与CTDE的等价性,表明CTDE的训练过程可通过智能体的个体目标实现。基于这一发现,我们提出了一种结合内在奖励与CTDE的创新方法。在SMAC和GRF基准测试的复杂任务上进行的大量实验证明了我们方法的优越性能。