Recent reinforcement learning (RL) methods have achieved success in various domains. However, multi-agent RL (MARL) remains a challenge in terms of decentralization, partial observability and scalability to many agents. Meanwhile, collective behavior requires resolution of the aforementioned challenges, and remains of importance to many state-of-the-art applications such as active matter physics, self-organizing systems, opinion dynamics, and biological or robotic swarms. Here, MARL via mean field control (MFC) offers a potential solution to scalability, but fails to consider decentralized and partially observable systems. In this paper, we enable decentralized behavior of agents under partial information by proposing novel models for decentralized partially observable MFC (Dec-POMFC), a broad class of problems with permutation-invariant agents allowing for reduction to tractable single-agent Markov decision processes (MDP) with single-agent RL solution. We provide rigorous theoretical results, including a dynamic programming principle, together with optimality guarantees for Dec-POMFC solutions applied to finite swarms of interest. Algorithmically, we propose Dec-POMFC-based policy gradient methods for MARL via centralized training and decentralized execution, together with policy gradient approximation guarantees. In addition, we improve upon state-of-the-art histogram-based MFC by kernel methods, which is of separate interest also for fully observable MFC. We evaluate numerically on representative collective behavior tasks such as adapted Kuramoto and Vicsek swarming models, being on par with state-of-the-art MARL. Overall, our framework takes a step towards RL-based engineering of artificial collective behavior via MFC.
翻译:近期强化学习方法已在多个领域取得成功。然而,多智能体强化学习在去中心化、部分可观测性及大规模智能体可扩展性方面仍面临挑战。与此同时,集体行为需要解决上述挑战,且对活性物质物理、自组织系统、意见动力学以及生物或机器人集群等前沿应用至关重要。当前,基于均场控制的多智能体强化学习为可扩展性提供了潜在解决方案,但未能考虑去中心化与部分可观测系统。本文通过提出去中心化部分可观测均场控制(Dec-POMFC)的新模型,使智能体能在部分信息下实现去中心化行为。这类问题具有置换不变性,可简化为可处理的单智能体马尔可夫决策过程,并通过单智能体强化学习求解。我们提供严格的理论结果,包括动态规划原理及Dec-POMFC解应用于有限集群的最优性保证。算法层面,提出基于Dec-POMFC的策略梯度方法,通过集中训练与分散执行实现多智能体强化学习,并给出策略梯度近似的理论保证。此外,我们通过核方法改进了基于直方图的均场控制(该改进对完全可观测均场控制亦具独立价值)。在库拉莫托与维切克集群模型等代表性集体行为任务上的数值评估表明,本框架性能与前沿多智能体强化学习方法相当。总体而言,本框架通过均场控制为基于强化学习的人工集体行为工程迈出关键一步。