Many applications, e.g., in shared mobility, require coordinating a large number of agents. Mean-field reinforcement learning addresses the resulting scalability challenge by optimizing the policy of a representative agent. In this paper, we address an important generalization where there exist global constraints on the distribution of agents (e.g., requiring capacity constraints or minimum coverage requirements to be met). We propose Safe-$\text{M}^3$-UCRL, the first model-based algorithm that attains safe policies even in the case of unknown transition dynamics. As a key ingredient, it uses epistemic uncertainty in the transition model within a log-barrier approach to ensure pessimistic constraints satisfaction with high probability. We showcase Safe-$\text{M}^3$-UCRL on the vehicle repositioning problem faced by many shared mobility operators and evaluate its performance through simulations built on Shenzhen taxi trajectory data. Our algorithm effectively meets the demand in critical areas while ensuring service accessibility in regions with low demand.
翻译:许多应用场景(如共享出行)需要协调大量智能体。均场强化学习通过优化代表性智能体的策略来解决由此产生的可扩展性挑战。本文研究了一个重要的泛化问题:当智能体分布存在全局约束时(例如需满足容量限制或最小覆盖要求),我们提出Safe-$\text{M}^3$-UCRL——首个即使在转移动力学未知的情况下也能保证安全策略的模型驱动算法。该算法的关键创新在于利用转移模型中的认知不确定性,结合对数障碍方法,以高概率确保悲观约束满足。我们通过共享出行运营商普遍面临的车辆重新定位问题验证了Safe-$\text{M}^3$-UCRL的性能,并基于深圳出租车轨迹数据构建仿真评估。实验表明,该算法在保障低需求区域服务可达性的同时,有效满足了关键区域的出行需求。