Effective exploration is crucial to discovering optimal strategies for multi-agent reinforcement learning (MARL) in complex coordination tasks. Existing methods mainly utilize intrinsic rewards to enable committed exploration or use role-based learning for decomposing joint action spaces instead of directly conducting a collective search in the entire action-observation space. However, they often face challenges obtaining specific joint action sequences to reach successful states in long-horizon tasks. To address this limitation, we propose Imagine, Initialize, and Explore (IIE), a novel method that offers a promising solution for efficient multi-agent exploration in complex scenarios. IIE employs a transformer model to imagine how the agents reach a critical state that can influence each other's transition functions. Then, we initialize the environment at this state using a simulator before the exploration phase. We formulate the imagination as a sequence modeling problem, where the states, observations, prompts, actions, and rewards are predicted autoregressively. The prompt consists of timestep-to-go, return-to-go, influence value, and one-shot demonstration, specifying the desired state and trajectory as well as guiding the action generation. By initializing agents at the critical states, IIE significantly increases the likelihood of discovering potentially important under-explored regions. Despite its simplicity, empirical results demonstrate that our method outperforms multi-agent exploration baselines on the StarCraft Multi-Agent Challenge (SMAC) and SMACv2 environments. Particularly, IIE shows improved performance in the sparse-reward SMAC tasks and produces more effective curricula over the initialized states than other generative methods, such as CVAE-GAN and diffusion models.
翻译:有效的探索对于在复杂协作任务中发现多智能体强化学习(MARL)的最优策略至关重要。现有方法主要利用内在动机奖励实现承诺式探索,或通过基于角色的学习分解联合动作空间,而非在完整的动作-观察空间中直接进行集体搜索。然而,这些方法在长时域任务中往往难以获取特定的联合动作序列以达到成功状态。为解决这一局限,我们提出"想象、初始化与探索"(IIE)方法——一种在复杂场景中实现高效多智能体探索的创新方案。IIE采用Transformer模型想象智能体如何到达能够相互影响转移函数的关键状态,随后在探索阶段前利用模拟器将环境初始化至该状态。我们将想象过程建模为序列预测问题,通过自回归方式预测状态、观测、提示、动作和奖励。提示由剩余时间步、剩余奖励、影响值及单样本示范组成,既指定了期望状态与轨迹,又引导了动作生成。通过在关键状态初始化智能体,IIE显著提升了发现潜在重要未充分探索区域的可能性。尽管方法简洁,实验结果表明,我们在星际争霸多智能体挑战(SMAC)及SMACv2环境中优于各类多智能体探索基线方法。值得注意的是,IIE在稀疏奖励SMAC任务中展现出更优性能,并且相较于CVAE-GAN和扩散模型等其他生成方法,能够针对初始化状态生成更有效的训练课程。