Learning in multi-agent systems is highly challenging due to several factors including the non-stationarity introduced by agents' interactions and the combinatorial nature of their state and action spaces. In particular, we consider the Mean-Field Control (MFC) problem which assumes an asymptotically infinite population of identical agents that aim to collaboratively maximize the collective reward. In many cases, solutions of an MFC problem are good approximations for large systems, hence, efficient learning for MFC is valuable for the analogous discrete agent setting with many agents. Specifically, we focus on the case of unknown system dynamics where the goal is to simultaneously optimize for the rewards and learn from experience. We propose an efficient model-based reinforcement learning algorithm, $M^3-UCRL$, that runs in episodes, balances between exploration and exploitation during policy learning, and provably solves this problem. Our main theoretical contributions are the first general regret bounds for model-based reinforcement learning for MFC, obtained via a novel mean-field type analysis. To learn the system's dynamics, $M^3-UCRL$ can be instantiated with various statistical models, e.g., neural networks or Gaussian Processes. Moreover, we provide a practical parametrization of the core optimization problem that facilitates gradient-based optimization techniques when combined with differentiable dynamics approximation methods such as neural networks.
翻译:在多智能体系统中,学习过程极具挑战性,这主要归因于智能体间交互引入的非平稳性以及其状态与动作空间的组合爆炸特性。本文重点关注均场控制问题,该问题假设存在渐近无穷多个同质智能体,其目标是通过协作最大化集体奖励。在许多场景下,均场控制问题的解可有效逼近大规模系统,因此针对该问题的高效学习对包含大量智能体的离散智能体设置具有重要价值。具体而言,我们考虑系统动力学未知的情形,目标是同时优化奖励函数并从经验中学习。我们提出一种高效的基于模型的强化学习算法$M^3-UCRL$,该算法采用回合制运行,在策略学习过程中平衡探索与利用,并具有理论上的可解性保证。我们的主要理论贡献在于:通过新颖的均场型分析,首次推导出基于模型的均场控制强化学习的通用遗憾界。为学习系统动力学,$M^3-UCRL$可实例化为多种统计模型(如神经网络或高斯过程)。此外,我们为核心优化问题提供了一种实用参数化方案,当与可微动力学近似方法(如神经网络)结合时,该方案可支持基于梯度的优化技术。