Pommerman is a multi-agent environment that has received considerable attention from researchers in recent years. This environment is an ideal benchmark for multi-agent training, providing a battleground for two teams with communication capabilities among allied agents. Pommerman presents significant challenges for model-free reinforcement learning due to delayed action effects, sparse rewards, and false positives, where opponent players can lose due to their own mistakes. This study introduces a system designed to train multi-agent systems to play Pommerman using a combination of curriculum learning and population-based self-play. We also tackle two challenging problems when deploying the multi-agent training system for competitive games: sparse reward and suitable matchmaking mechanism. Specifically, we propose an adaptive annealing factor based on agents' performance to adjust the dense exploration reward during training dynamically. Additionally, we implement a matchmaking mechanism utilizing the Elo rating system to pair agents effectively. Our experimental results demonstrate that our trained agent can outperform top learning agents without requiring communication among allied agents.
翻译:Pommerman是一个近年来受到研究者广泛关注的多智能体环境。该环境为多智能体训练提供了理想的基准测试平台,为两支具备盟友间通信能力的队伍提供了对战场景。由于动作效果的延迟性、稀疏奖励以及对手可能因自身失误而失败的误报情况,Pommerman对无模型强化学习提出了重大挑战。本研究提出了一种结合课程学习与种群自博弈的系统,用于训练多智能体系统玩Pommerman游戏。我们还解决了在竞技游戏中部署多智能体训练系统时的两个难题:稀疏奖励问题与合适的匹配机制。具体而言,我们提出了一种基于智能体表现的自适应退火因子,用于在训练过程中动态调整密集探索奖励。此外,我们采用基于Elo评分系统的匹配机制来实现智能体的有效配对。实验结果表明,我们训练的智能体能够在无需盟友间通信的情况下,超越顶级学习智能体的表现。