Many applications, e.g., in shared mobility, require coordinating a large number of agents. Mean-field reinforcement learning addresses the resulting scalability challenge by optimizing the policy of a representative agent interacting with the infinite population of identical agents instead of considering individual pairwise interactions. In this paper, we address an important generalization where there exist global constraints on the distribution of agents (e.g., requiring capacity constraints or minimum coverage requirements to be met). We propose Safe-M$^3$-UCRL, the first model-based mean-field reinforcement learning algorithm that attains safe policies even in the case of unknown transitions. As a key ingredient, it uses epistemic uncertainty in the transition model within a log-barrier approach to ensure pessimistic constraints satisfaction with high probability. Beyond the synthetic swarm motion benchmark, we showcase Safe-M$^3$-UCRL on the vehicle repositioning problem faced by many shared mobility operators and evaluate its performance through simulations built on vehicle trajectory data from a service provider in Shenzhen. Our algorithm effectively meets the demand in critical areas while ensuring service accessibility in regions with low demand.
翻译:许多应用(例如共享出行)需要协调大量智能体。平均场强化学习通过优化一个代表性智能体的策略(该智能体与无限多个同质智能体群体交互,而非考虑个体间的成对交互),解决了由此产生的可扩展性挑战。在本文中,我们探讨了一个重要的泛化场景:存在关于智能体分布的全局约束(例如,需要满足容量约束或最小覆盖要求)。我们提出Safe-M$^3$-UCRL,这是首个基于模型的平均场强化学习算法,即使在转移概率未知的情况下也能获得安全策略。其关键组成部分是在对数障碍方法中利用转移模型的认知不确定性,以高概率确保悲观的约束满足。除了合成蜂群运动基准测试,我们还在许多共享出行运营商面临的车辆重新定位问题上展示了Safe-M$^3$-UCRL的性能,并通过基于深圳某服务提供商车辆轨迹数据的仿真评估其表现。我们的算法有效满足了关键区域的需求,同时确保了低需求区域的服务可达性。