In cooperative multi-agent reinforcement learning (MARL), agents aim to achieve a common goal, such as defeating enemies or scoring a goal. Existing MARL algorithms are effective but still require significant learning time and often get trapped in local optima by complex tasks, subsequently failing to discover a goal-reaching policy. To address this, we introduce Efficient episodic Memory Utilization (EMU) for MARL, with two primary objectives: (a) accelerating reinforcement learning by leveraging semantically coherent memory from an episodic buffer and (b) selectively promoting desirable transitions to prevent local convergence. To achieve (a), EMU incorporates a trainable encoder/decoder structure alongside MARL, creating coherent memory embeddings that facilitate exploratory memory recall. To achieve (b), EMU introduces a novel reward structure called episodic incentive based on the desirability of states. This reward improves the TD target in Q-learning and acts as an additional incentive for desirable transitions. We provide theoretical support for the proposed incentive and demonstrate the effectiveness of EMU compared to conventional episodic control. The proposed method is evaluated in StarCraft II and Google Research Football, and empirical results indicate further performance improvement over state-of-the-art methods.
翻译:在合作型多智能体强化学习(MARL)中,智能体旨在实现共同目标,例如击败敌人或进球得分。现有MARL算法虽有效,但仍需大量学习时间,且常因复杂任务陷入局部最优,进而无法发现达成目标的策略。为解决此问题,我们提出面向MARL的高效情景记忆利用方法(EMU),其两大目标为:(a)通过利用情景缓冲区中的语义连贯记忆加速强化学习;(b)选择性促进期望状态转移以防止局部收敛。为实现目标(a),EMU在MARL中引入可训练的编码器/解码器结构,生成连贯的记忆嵌入以促进探索性记忆召回。为实现目标(b),EMU基于状态期望度提出名为情景激励的新型奖励结构。该奖励可改进Q学习中的TD目标,并作为期望状态转移的额外激励。我们为所提激励方法提供了理论支撑,并验证了EMU相较于传统情景控制的有效性。该方法在星际争霸II和谷歌研究足球平台上进行评估,实验结果表明其性能优于现有最先进方法。