Adaptive human-agent and agent-agent cooperation are becoming more and more critical in the research area of multi-agent reinforcement learning (MARL), where remarked progress has been made with the help of deep neural networks. However, many established algorithms can only perform well during the learning paradigm but exhibit poor generalization during cooperation with other unseen partners. The personality theory in cognitive psychology describes that humans can well handle the above cooperation challenge by predicting others' personalities first and then their complex actions. Inspired by this two-step psychology theory, we propose a biologically plausible mixture of personality (MoP) improved spiking actor network (SAN), whereby a determinantal point process is used to simulate the complex formation and integration of different types of personality in MoP, and dynamic and spiking neurons are incorporated into the SAN for the efficient reinforcement learning. The benchmark Overcooked task, containing a strong requirement for cooperative cooking, is selected to test the proposed MoP-SAN. The experimental results show that the MoP-SAN can achieve both high performances during not only the learning paradigm but also the generalization test (i.e., cooperation with other unseen agents) paradigm where most counterpart deep actor networks failed. Necessary ablation experiments and visualization analyses were conducted to explain why MoP and SAN are effective in multi-agent reinforcement learning scenarios while DNN performs poorly in the generalization test.
翻译:在基于多智能体强化学习(MARL)的研究领域,自适应的人机协同与智能体间协同正变得愈发关键,深度神经网络已在此领域取得显著进展。然而,许多现有算法仅能在学习范式下表现良好,在与其他未见过的合作者协同时会暴露出泛化能力不足的问题。认知心理学中的人格理论指出,人类能够通过首先预测他人人格特征、进而预测其复杂行为的方式,有效应对上述协同挑战。受此两阶段心理学理论的启发,我们提出了一种生物可解释的人格混合脉冲演员网络(MoP-SAN),其中采用行列式点过程模拟MoP中不同类型人格的复杂形成与整合机制,并在SAN中引入动态脉冲神经元以实现高效强化学习。我们选用对协同烹饪具有较高要求的Overcooked基准任务进行测试,实验结果表明:MoP-SAN不仅在学习范式下表现优异,在多数深度演员网络失效的泛化测试(即与未见过的其他智能体协同)中同样能达到高性能。通过必要的消融实验与可视化分析,我们解释了为何MoP与SAN在多智能体强化学习场景中表现有效,而深度神经网络在泛化测试中性能欠佳。