In a multi-agent system (MAS), action semantics indicates the different influences of agents' actions toward other entities, and can be used to divide agents into groups in a physically heterogeneous MAS. Previous multi-agent reinforcement learning (MARL) algorithms apply global parameter-sharing across different types of heterogeneous agents without careful discrimination of different action semantics. This common implementation decreases the cooperation and coordination between agents in complex situations. However, fully independent agent parameters dramatically increase the computational cost and training difficulty. In order to benefit from the usage of different action semantics while also maintaining a proper parameter-sharing structure, we introduce the Unified Action Space (UAS) to fulfill the requirement. The UAS is the union set of all agent actions with different semantics. All agents first calculate their unified representation in the UAS, and then generate their heterogeneous action policies using different available-action-masks. To further improve the training of extra UAS parameters, we introduce a Cross-Group Inverse (CGI) loss to predict other groups' agent policies with the trajectory information. As a universal method for solving the physically heterogeneous MARL problem, we implement the UAS adding to both value-based and policy-based MARL algorithms, and propose two practical algorithms: U-QMIX and U-MAPPO. Experimental results in the SMAC environment prove the effectiveness of both U-QMIX and U-MAPPO compared with several state-of-the-art MARL methods.
翻译:在多智能体系统(MAS)中,动作语义反映了智能体动作对其他实体的不同影响,可用于在物理异构MAS中将智能体划分为不同组别。以往的多智能体强化学习(MARL)算法在不同类型的异构智能体间采用全局参数共享,而未仔细区分不同的动作语义。这种常见实现方式降低了智能体在复杂情境下的协作与协调能力。然而,完全独立的智能体参数会显著增加计算成本和训练难度。为了在利用不同动作语义优势的同时保持合理的参数共享结构,我们引入统一动作空间(UAS)来满足这一需求。UAS是所有具有不同语义的智能体动作的并集。所有智能体首先在UAS中计算其统一表示,然后通过不同的可用动作掩码生成各自的异构动作策略。为进一步优化额外UAS参数的训练,我们引入跨组逆推(CGI)损失函数,利用轨迹信息预测其他组别的智能体策略。作为解决物理异构MARL问题的通用方法,我们将UAS分别应用于基于价值和基于策略的MARL算法,并提出两种实用算法:U-QMIX和U-MAPPO。在SMAC环境中的实验结果表明,与多种先进MARL方法相比,U-QMIX和U-MAPPO均具有显著性能优势。