Recently, mean field control (MFC) has provided a tractable and theoretically founded approach to otherwise difficult cooperative multi-agent control. However, the strict assumption of many independent, homogeneous agents may be too stringent in practice. In this work, we propose a novel discrete-time generalization of Markov decision processes and MFC to both many minor agents and potentially complex major agents -- major-minor mean field control (M3FC). In contrast to deterministic MFC, M3FC allows for stochastic minor agent distributions with strong correlation between minor agents through the major agent state, which can model arbitrary problem details not bound to any agent. Theoretically, we give rigorous approximation properties with novel proofs for both M3FC and existing MFC models in the finite multi-agent problem, together with a dynamic programming principle for solving such problems. In the infinite-horizon discounted case, existence of an optimal stationary policy follows. Algorithmically, we propose the major-minor mean field proximal policy optimization algorithm (M3FPPO) as a novel multi-agent reinforcement learning algorithm and demonstrate its success in illustrative M3FC-type problems.
翻译:近期,平均场控制(MFC)为解决原本复杂的协同多智能体控制问题提供了一种可处理且具有理论基础的途径。然而,大量独立同质智能体的严格假设在实际应用中可能过于严苛。本文针对同时包含大量次要智能体与潜在复杂主要智能体的场景,提出一种新型离散时间马尔可夫决策过程与平均场控制的泛化框架——主次平均场控制(M3FC)。与确定性MFC不同,M3FC允许次要智能体分布存在随机性,并通过主要智能体状态实现次要智能体间的强相关性,进而可建模不依附于任何智能体的任意问题细节。理论方面,我们为M3FC及现有MFC模型在有限多智能体问题中建立了严格的逼近性质并给出新颖证明,同时提出求解此类问题的动态规划原理。在无限时域折扣情形下,最优平稳策略的存在性得以证明。算法层面,我们提出主次平均场近端策略优化算法(M3FPPO)作为新型多智能体强化学习算法,并在典型M3FC类型问题中验证了其有效性。