In real-world environments, autonomous agents rely on their egocentric observations. They must learn adaptive strategies to interact with others who possess mixed motivations, discernible only through visible cues. Several Multi-Agent Reinforcement Learning (MARL) methods adopt centralized approaches that involve either centralized training or reward-sharing, often violating the realistic ways in which living organisms, like animals or humans, process information and interact. MARL strategies deploying decentralized training with intrinsic motivation offer a self-supervised approach, enable agents to develop flexible social strategies through the interaction of autonomous agents. However, by contrasting the self-supervised and centralized methods, we reveal that populations trained with reward-sharing methods surpass those using self-supervised methods in a mixed-motive environment. We link this superiority to specialized role emergence and an agent's expertise in its role. Interestingly, this gap shrinks in pure-motive settings, emphasizing the need for evaluations in more complex, realistic environments (mixed-motive). Our preliminary results suggest a gap in population performance that can be closed by improving self-supervised methods and thereby pushing MARL closer to real-world readiness.
翻译:在现实环境中,自主智能体依赖于自身的以自我为中心的观察。它们必须学习自适应策略,与具有混合动机的其他智能体互动,这些动机仅通过可见线索才能辨别。多种多智能体强化学习方法采用集中式方法,涉及集中训练或奖励共享,常常违反了动物或人类等生物体处理信息和互动的现实方式。采用分散训练与内在动机的多智能体强化学习策略提供了一种自监督方法,使智能体能够通过自主智能体的互动发展灵活的社会策略。然而,通过对比自监督和集中式方法,我们揭示出在混合动机环境中,使用奖励共享方法训练的群体优于使用自监督方法的群体。我们将这种优越性归因于专门角色的出现以及智能体在其角色中的专长。有趣的是,这种差距在纯动机设置中缩小,强调了在更复杂、现实的环境(混合动机)中进行评估的必要性。我们的初步结果表明,通过改进自监督方法可以缩小群体性能差距,从而推动多智能体强化学习更接近现实世界的可用性。