\textit{Relative overgeneralization} (RO) occurs in cooperative multi-agent learning tasks when agents converge towards a suboptimal joint policy due to overfitting to suboptimal behavior of other agents. In early work, optimism has been shown to mitigate the \textit{RO} problem when using tabular Q-learning. However, with function approximation optimism can amplify overestimation and thus fail on complex tasks. On the other hand, recent deep multi-agent policy gradient (MAPG) methods have succeeded in many complex tasks but may fail with severe \textit{RO}. We propose a general, yet simple, framework to enable optimistic updates in MAPG methods and alleviate the RO problem. Specifically, we employ a \textit{Leaky ReLU} function where a single hyperparameter selects the degree of optimism to reshape the advantages when updating the policy. Intuitively, our method remains optimistic toward individual actions with lower returns which are potentially caused by other agents' sub-optimal behavior during learning. The optimism prevents the individual agents from quickly converging to a local optimum. We also provide a formal analysis from an operator view to understand the proposed advantage transformation. In extensive evaluations on diverse sets of tasks, including illustrative matrix games, complex \textit{Multi-agent MuJoCo} and \textit{Overcooked} benchmarks, the proposed method\footnote{Code can be found at \url{https://github.com/wenshuaizhao/optimappo}.} outperforms strong baselines on 13 out of 19 tested tasks and matches the performance on the rest.
翻译:相对过度泛化(RO)问题出现在协作式多智能体学习任务中,当智能体因过度拟合其他智能体的次优行为而收敛至次优联合策略时。早期研究表明,在表格型Q学习中使用乐观机制可缓解RO问题。然而,结合函数逼近时,乐观机制可能放大过估计问题,因而在复杂任务中失效。另一方面,近期深度多智能体策略梯度(MAPG)方法虽在众多复杂任务中取得成功,但在严重RO场景下仍可能失败。我们提出一种通用且简洁的框架,可在MAPG方法中实现乐观更新并缓解RO问题。具体而言,我们采用带单个超参数的Leaky ReLU函数,通过调整乐观程度重塑策略更新时的优势函数。直观上,本方法对因学习过程中其他智能体次优行为导致回报较低的个体动作保持乐观态度,这种乐观性防止个体智能体快速收敛至局部最优。我们从算子视角提供形式化分析以理解所提出的优势变换。在涵盖矩阵博弈示例、复杂多智能体MuJoCo和Overcooked基准测试等多样化任务的广泛评估中,本方法在19个测试任务中有13个超越强基线模型,并在其余任务中达到匹配性能。\footnote{代码见\url{https://github.com/wenshuaizhao/optimappo}}