Taming Multi-Agent Reinforcement Learning with Estimator Variance Reduction

Centralised training with decentralised execution (CT-DE) serves as the foundation of many leading multi-agent reinforcement learning (MARL) algorithms. Despite its popularity, it suffers from a critical drawback due to its reliance on learning from a single sample of the joint-action at a given state. As agents explore and update their policies during training, these single samples may poorly represent the actual joint-policy of the system of agents leading to high variance gradient estimates that hinder learning. To address this problem, we propose an enhancement tool that accommodates any actor-critic MARL method. Our framework, Performance Enhancing Reinforcement Learning Apparatus (PERLA), introduces a sampling technique of the agents' joint-policy into the critics while the agents train. This leads to TD updates that closely approximate the true expected value under the current joint-policy rather than estimates from a single sample of the joint-action at a given state. This produces low variance and precise estimates of expected returns, minimising the variance in the critic estimators which typically hinders learning. Moreover, as we demonstrate, by eliminating much of the critic variance from the single sampling of the joint policy, PERLA enables CT-DE methods to scale more efficiently with the number of agents. Theoretically, we prove that PERLA reduces variance in value estimates similar to that of decentralised training while maintaining the benefits of centralised training. Empirically, we demonstrate PERLA's superior performance and ability to reduce estimator variance in a range of benchmarks including Multi-agent Mujoco, and StarCraft II Multi-agent Challenge.

翻译：集中训练与分散执行（CT-DE）是众多前沿多智能体强化学习（MARL）算法的核心基础。尽管该方法广受欢迎，但由于其依赖于从给定状态下联合动作的单一样本中学习，存在一个关键缺陷：在训练过程中，随着智能体探索并更新其策略，这些单一样本可能无法有效代表系统智能体的实际联合策略，导致梯度估计方差偏大，从而阻碍学习。为解决此问题，我们提出一种可适配任意演员-评论家类MARL方法的增强工具。我们的框架——性能增强强化学习装置（PERLA）——在智能体训练期间，将智能体联合策略的采样技术引入评论家网络。这使得时序差分（TD）更新能够更精准地逼近当前联合策略下的真实期望值，而非依赖于给定状态下联合动作单一样本的估计值。该方法可生成低方差且精确的期望回报估计，从而最小化通常阻碍学习的评论家估计器方差。此外，我们证明，通过消除由单次采样联合策略导致的大部分评论家方差，PERLA能够使CT-DE方法随智能体数量增加而更高效地扩展。理论上，我们证明了PERLA在保持集中训练优势的同时，能将值估计方差降低至与分散训练相当的水平。实验表明，在包括Multi-agent Mujoco和星际争霸II多智能体挑战赛在内的一系列基准测试中，PERLA展现了卓越性能及降低估计方差的能力。