Cooperative multi-agent reinforcement learning (MARL) requires agents to explore to learn to cooperate. Existing value-based MARL algorithms commonly rely on random exploration, such as $\epsilon$-greedy, which is inefficient in discovering multi-agent cooperation. Additionally, the environment in MARL appears non-stationary to any individual agent due to the simultaneous training of other agents, leading to highly variant and thus unstable optimisation signals. In this work, we propose ensemble value functions for multi-agent exploration (EMAX), a general framework to extend any value-based MARL algorithm. EMAX trains ensembles of value functions for each agent to address the key challenges of exploration and non-stationarity: (1) The uncertainty of value estimates across the ensemble is used in a UCB policy to guide the exploration of agents to parts of the environment which require cooperation. (2) Average value estimates across the ensemble serve as target values. These targets exhibit lower variance compared to commonly applied target networks and we show that they lead to more stable gradients during the optimisation. We instantiate three value-based MARL algorithms with EMAX, independent DQN, VDN and QMIX, and evaluate them in 21 tasks across four environments. Using ensembles of five value functions, EMAX improves sample efficiency and final evaluation returns of these algorithms by 53%, 36%, and 498%, respectively, averaged all 21 tasks.
翻译:协同多智能体强化学习需要智能体通过探索来学习合作。现有的基于价值的MARL算法通常依赖随机探索(如$\epsilon$-贪心策略),这在发现多智能体协作方面效率低下。此外,由于其他智能体同时训练,MARL环境对任一智能体而言均呈现非平稳性,导致优化信号高度变异且不稳定。本文提出了一种用于多智能体探索的集成价值函数框架EMAX,该通用框架可扩展至任何基于价值的MARL算法。EMAX为每个智能体训练价值函数集成,以应对探索与非平稳性的关键挑战:(1) 利用集成中价值估计的不确定性,通过UCB策略引导智能体探索环境中需要协作的区域;(2) 集成中的平均价值估计作为目标值,与常用的目标网络相比,这些目标值方差更低,且我们证明其可在优化过程中产生更稳定的梯度。我们将EMAX实例化到三种基于价值的MARL算法(独立DQN、VDN和QMIX)中,并在四个环境的21个任务上进行了评估。通过使用五个价值函数的集成,EMAX使这些算法的采样效率和最终评估回报分别平均提升了53%、36%和498%(基于全部21个任务的均值)。