Scaling cooperative multi-agent reinforcement learning (MARL) is fundamentally limited by cross-agent noise: when agents share a common reward, the actions of all $N$ agents jointly determine each agent's learning signal, so cross-agent noise grows with $N$. In the policy gradient setting, per-agent gradient estimate variance scales as $Θ(N)$, yielding sample complexity $\mathcal{O}(N/ε)$. We observe that many domains -- cloud computing, transportation, power systems -- have differentiable analytical models that prescribe efficient system states. In this work, we propose Descent-Guided Policy Gradient (DG-PG), a framework that constructs noise-free per-agent guidance gradients from these analytical models, decoupling each agent's gradient from the actions of all others. We prove that DG-PG reduces gradient variance from $Θ(N)$ to $\mathcal{O}(1)$, preserves the equilibria of the cooperative game, and achieves agent-independent sample complexity $\mathcal{O}(1/ε)$. On a heterogeneous cloud scheduling task with up to 200 agents, DG-PG converges within 10 episodes at every tested scale -- from $N=5$ to $N=200$ -- directly confirming the predicted scale-invariant complexity, while MAPPO and IPPO fail to converge under identical architectures.
翻译:扩展协同多智能体强化学习(MARL)从根本上受到跨智能体噪声的限制:当智能体共享共同奖励时,所有 $N$ 个智能体的行为共同决定了每个智能体的学习信号,因此跨智能体噪声随 $N$ 增长。在策略梯度设置中,每个智能体的梯度估计方差按 $Θ(N)$ 缩放,导致样本复杂度为 $\mathcal{O}(N/ε)$。我们观察到许多领域——云计算、交通、电力系统——拥有可微分的解析模型,这些模型规定了高效的系统状态。在本工作中,我们提出下降引导策略梯度(DG-PG),这是一个从这些解析模型构建无噪声的每智能体引导梯度的框架,将每个智能体的梯度与其他所有智能体的行为解耦。我们证明 DG-PG 将梯度方差从 $Θ(N)$ 降低至 $\mathcal{O}(1)$,保留了协同博弈的均衡点,并实现了与智能体数量无关的样本复杂度 $\mathcal{O}(1/ε)$。在一个包含多达 200 个智能体的异构云调度任务上,DG-PG 在每一个测试规模下(从 $N=5$ 到 $N=200$)均在 10 个训练周期内收敛,直接证实了所预测的规模不变复杂度,而 MAPPO 和 IPPO 在相同架构下则无法收敛。