There is a prevalence of multiagent reinforcement learning (MARL) methods that engage in centralized training. But, these methods involve obtaining various types of information from the other agents, which may not be feasible in competitive or adversarial settings. A recent method, the interactive advantage actor critic (IA2C), engages in decentralized training coupled with decentralized execution, aiming to predict the other agents' actions from possibly noisy observations. In this paper, we present the latent IA2C that utilizes an encoder-decoder architecture to learn a latent representation of the hidden state and other agents' actions. Our experiments in two domains -- each populated by many agents -- reveal that the latent IA2C significantly improves sample efficiency by reducing variance and converging faster. Additionally, we introduce open versions of these domains where the agent population may change over time, and evaluate on these instances as well.
翻译:多智能体强化学习方法普遍采用集中式训练。然而,这些方法需从其他智能体获取各类信息,这在竞争或对抗性环境中可能不可行。近期提出的交互式优势动作评论算法(IA2C)采用分布式训练与分布式执行相结合的方式,旨在通过可能包含噪声的观测值预测其他智能体的动作。本文提出潜交互式IA2C算法,利用编码器-解码器架构学习隐状态及其他智能体动作的潜在表征。我们在两个多智能体密集分布的领域进行的实验表明,潜交互式IA2C通过降低方差并加速收敛,显著提升了样本效率。此外,我们引入了上述领域的开环版本(智能体群体可能随时间变化),并在这些实例上进行了性能评估。