In multi-agent reinforcement learning, the behaviors that agents learn in a single Markov Game (MG) are typically confined to the given agent number. Every single MG induced by varying the population may possess distinct optimal joint strategies and game-specific knowledge, which are modeled independently in modern multi-agent reinforcement learning algorithms. In this work, our focus is on creating agents that can generalize across population-varying MGs. Instead of learning a unimodal policy, each agent learns a policy set comprising effective strategies across a variety of games. To achieve this, we propose Meta Representations for Agents (MRA) that explicitly models the game-common and game-specific strategic knowledge. By representing the policy sets with multi-modal latent policies, the game-common strategic knowledge and diverse strategic modes are discovered through an iterative optimization procedure. We prove that by approximately maximizing the resulting constrained mutual information objective, the policies can reach Nash Equilibrium in every evaluation MG when the latent space is sufficiently large. When deploying MRA in practical settings with limited latent space sizes, fast adaptation can be achieved by leveraging the first-order gradient information. Extensive experiments demonstrate the effectiveness of MRA in improving training performance and generalization ability in challenging evaluation games.
翻译:在多智能体强化学习中,智能体在单个马尔可夫博弈中学习的行为通常受限于特定的智能体数量。由不同群体规模导致的每个马尔可夫博弈可能具有独特的最优联合策略和博弈特定知识,而现代多智能体强化学习算法通常对这些知识进行独立建模。本研究的重点在于创建能够跨群体变化马尔可夫博弈进行泛化的智能体。不同于学习单模策略,每个智能体学习一个包含跨多种博弈有效策略的策略集合。为实现这一目标,我们提出智能体元表示方法,该方法显式建模博弈共性与博弈特定策略知识。通过用多模态潜在策略表示策略集合,迭代优化过程可发现博弈共性策略知识与多样化策略模式。我们证明,当潜在空间足够大时,通过近似最大化带约束的互信息目标,策略可在每个评估马尔可夫博弈中达到纳什均衡。在有限潜在空间的实际部署中,可利用一阶梯梯度信息实现快速适应。大量实验表明,MRA在提升具有挑战性的评估博弈中的训练性能与泛化能力方面具有显著效果。