Efficient exploration is critical in cooperative deep Multi-Agent Reinforcement Learning (MARL). In this paper, we propose an exploration method that efficiently encourages cooperative exploration based on the idea of the theoretically justified tree search algorithm UCT (Upper Confidence bounds applied to Trees). The high-level intuition is that to perform optimism-based exploration, agents would achieve cooperative strategies if each agent's optimism estimate captures a structured dependency relationship with other agents. At each node (i.e., action) of the search tree, UCT performs optimism-based exploration using a bonus derived by conditioning on the visitation count of its parent node. We provide a perspective to view MARL as tree search iterations and develop a method called Conditionally Optimistic Exploration (COE). We assume agents take actions following a sequential order, and consider nodes at the same depth of the search tree as actions of one individual agent. COE computes each agent's state-action value estimate with an optimistic bonus derived from the visitation count of the state and joint actions taken by agents up to the current agent. COE is adaptable to any value decomposition method for centralized training with decentralized execution. Experiments across various cooperative MARL benchmarks show that COE outperforms current state-of-the-art exploration methods on hard-exploration tasks.
翻译:高效探索在合作深度多智能体强化学习(MARL)中至关重要。本文基于理论完善的树搜索算法UCT(应用于树的置信上界)的思想,提出了一种有效促进合作探索的探索方法。其高层直觉在于:若要实现基于乐观的探索,智能体需在其乐观估计中捕捉与其他智能体的结构化依赖关系,方能达成合作策略。在搜索树的每个节点(即动作)处,UCT通过基于父节点访问次数条件化导出的奖励额外项来实现乐观探索。我们将MARL视为树搜索迭代过程,并据此提出一种称为条件乐观探索(COE)的方法。该方法假设智能体按顺序执行动作,将搜索树中相同深度的节点视为单个智能体的动作。COE基于当前智能体之前的状态与联合动作的访问次数,为每个智能体的状态-动作值估计计算包含乐观奖励额外项的估值。该方法可适配任何基于值分解的中心化训练与去中心化执行方法。在多个合作MARL基准任务上的实验表明,COE在困难探索任务上优于当前最先进的探索方法。