Recent reinforcement learning (RL) methods have achieved success in various domains. However, multi-agent RL (MARL) remains a challenge in terms of decentralization, partial observability and scalability to many agents. Meanwhile, collective behavior requires resolution of the aforementioned challenges, and remains of importance to many state-of-the-art applications such as active matter physics, self-organizing systems, opinion dynamics, and biological or robotic swarms. Here, MARL via mean field control (MFC) offers a potential solution to scalability, but fails to consider decentralized and partially observable systems. In this paper, we enable decentralized behavior of agents under partial information by proposing novel models for decentralized partially observable MFC (Dec-POMFC), a broad class of problems with permutation-invariant agents allowing for reduction to tractable single-agent Markov decision processes (MDP) with single-agent RL solution. We provide rigorous theoretical results, including a dynamic programming principle, together with optimality guarantees for Dec-POMFC solutions applied to finite swarms of interest. Algorithmically, we propose Dec-POMFC-based policy gradient methods for MARL via centralized training and decentralized execution, together with policy gradient approximation guarantees. In addition, we improve upon state-of-the-art histogram-based MFC by kernel methods, which is of separate interest also for fully observable MFC. We evaluate numerically on representative collective behavior tasks such as adapted Kuramoto and Vicsek swarming models, being on par with state-of-the-art MARL. Overall, our framework takes a step towards RL-based engineering of artificial collective behavior via MFC.
翻译:近期的强化学习方法已在多个领域取得成功。然而,多智能体强化学习在去中心化、部分可观测性以及大规模智能体可扩展性方面仍面临挑战。与此同时,集体行为需要解决上述挑战,并且对于许多前沿应用(如活性物质物理学、自组织系统、意见动力学以及生物或机器人集群)仍具有重要意义。本文中,基于平均场控制的多智能体强化学习为可扩展性问题提供了潜在解决方案,但未考虑去中心化和部分可观测系统。为解决这一问题,我们通过提出去中心化部分可观测平均场控制的新模型,使得智能体在部分信息条件下实现去中心化行为。该模型属于一大类具有置换不变性的智能体问题,可简化为可处理的单智能体马尔可夫决策过程,并采用单智能体强化学习求解。我们提供了严格的理论结果,包括动态规划原理,以及应用于有限规模集群的Dec-POMFC解决方案的最优性保证。在算法层面,我们提出了基于Dec-POMFC的策略梯度方法,通过集中式训练与去中心化执行实现多智能体强化学习,并提供了策略梯度近似保证。此外,我们采用核方法改进了基于直方图的平均场控制,该改进方法对于完全可观测平均场控制也具有独立价值。在代表性集体行为任务(如改进的Kuramoto和Vicsek集群模型)上的数值评估表明,我们的方法与最先进的多智能体强化学习性能相当。总体而言,我们的框架通过平均场控制为实现基于强化学习的人工集体行为工程迈出了重要一步。