In this paper, a novel Multi-agent Reinforcement Learning (MARL) approach, Multi-Agent Continuous Dynamic Policy Gradient (MACDPP) was proposed to tackle the issues of limited capability and sample efficiency in various scenarios controlled by multiple agents. It alleviates the inconsistency of multiple agents' policy updates by introducing the relative entropy regularization to the Centralized Training with Decentralized Execution (CTDE) framework with the Actor-Critic (AC) structure. Evaluated by multi-agent cooperation and competition tasks and traditional control tasks including OpenAI benchmarks and robot arm manipulation, MACDPP demonstrates significant superiority in learning capability and sample efficiency compared with both related multi-agent and widely implemented signal-agent baselines and therefore expands the potential of MARL in effectively learning challenging control scenarios.
翻译:本文提出了一种新颖的多智能体强化学习方法——多智能体连续动态策略梯度(MACDPP),旨在解决多智能体控制场景中存在的学习能力有限及样本效率低下问题。该方法通过将相对熵正则化引入基于Actor-Critic结构的多智能体集中训练与分散执行(CTDE)框架中,有效缓解了多智能体策略更新不一致性。在包含OpenAI基准测试和机械臂操作任务的多智能体协作与竞争任务以及传统控制任务上的评估表明,MACDPP在学习能力和样本效率方面均显著优于相关多智能体基线及广泛实现的单智能体基线,从而拓展了多智能体强化学习在有效学习复杂控制场景方面的潜力。