In cooperative multi-agent reinforcement learning (MARL), agents aim to achieve a common goal, such as defeating enemies or scoring a goal. Existing MARL algorithms are effective but still require significant learning time and often get trapped in local optima by complex tasks, subsequently failing to discover a goal-reaching policy. To address this, we introduce Efficient episodic Memory Utilization (EMU) for MARL, with two primary objectives: (a) accelerating reinforcement learning by leveraging semantically coherent memory from an episodic buffer and (b) selectively promoting desirable transitions to prevent local convergence. To achieve (a), EMU incorporates a trainable encoder/decoder structure alongside MARL, creating coherent memory embeddings that facilitate exploratory memory recall. To achieve (b), EMU introduces a novel reward structure called episodic incentive based on the desirability of states. This reward improves the TD target in Q-learning and acts as an additional incentive for desirable transitions. We provide theoretical support for the proposed incentive and demonstrate the effectiveness of EMU compared to conventional episodic control. The proposed method is evaluated in StarCraft II and Google Research Football, and empirical results indicate further performance improvement over state-of-the-art methods.
翻译:在协作多智能体强化学习(MARL)中,智能体旨在实现共同目标,例如击败敌人或得分。现有MARL算法有效,但仍需大量学习时间,且常因复杂任务陷入局部最优,从而无法发现达成目标的策略。为解决这一问题,我们针对MARL提出了高效情景记忆利用(EMU),其两大目标是:(a)通过利用情景缓存中的语义一致记忆加速强化学习;(b)选择性促进理想的状态转换以避免局部收敛。为实现(a),EMU在MARL基础上集成了可训练的编码器/解码器结构,生成连贯的记忆嵌入,便于探索性记忆回忆。为实现(b),EMU基于状态理想性引入了一种新型奖励结构——情景激励。该奖励改进了Q学习中的TD目标,并作为理想状态转换的额外激励。我们为所提出的激励提供了理论支撑,并论证了EMU相较传统情景控制的有效性。该方法在星际争霸II和谷歌研究足球环境中进行了评估,实验结果表明其性能较现有最先进方法有进一步提升。