Efficient exploration is critical in cooperative deep Multi-Agent Reinforcement Learning (MARL). In this work, we propose an exploration method that effectively encourages cooperative exploration based on the idea of sequential action-computation scheme. The high-level intuition is that to perform optimism-based exploration, agents would explore cooperative strategies if each agent's optimism estimate captures a structured dependency relationship with other agents. Assuming agents compute actions following a sequential order at \textit{each environment timestep}, we provide a perspective to view MARL as tree search iterations by considering agents as nodes at different depths of the search tree. Inspired by the theoretically justified tree search algorithm UCT (Upper Confidence bounds applied to Trees), we develop a method called Conditionally Optimistic Exploration (COE). COE augments each agent's state-action value estimate with an action-conditioned optimistic bonus derived from the visitation count of the global state and joint actions of preceding agents. COE is performed during training and disabled at deployment, making it compatible with any value decomposition method for centralized training with decentralized execution. Experiments across various cooperative MARL benchmarks show that COE outperforms current state-of-the-art exploration methods on hard-exploration tasks.
翻译:高效探索在合作式深度多智能体强化学习(MARL)中至关重要。本文基于顺序动作计算方案的思想,提出一种有效促进合作式探索的方法。其核心直觉在于:若每个智能体的乐观估计能够捕捉与其他智能体之间的结构化依赖关系,则基于乐观主义的探索方法可使智能体探索合作策略。假设在\textit{每个环境时间步}智能体按顺序执行动作,通过将智能体视为搜索树中不同深度的节点,我们提供了一种将MARL视为树搜索迭代的视角。受理论证明的树搜索算法UCT(应用于树的置信上界)启发,我们提出一种称为条件乐观探索(COE)的方法。COE利用全局状态与前序智能体联合动作的访问计数,推导出基于动作条件的乐观奖励,并将其附加到每个智能体的状态-动作值估计中。COE在训练阶段执行,部署时禁用,因此可与任何集中训练分散执行的值分解方法兼容。在多个合作式MARL基准测试上的实验表明,COE在困难探索任务上优于当前最先进的探索方法。