Multi-agent reinforcement learning (MARL) provides a promising paradigm for coordinating multi-agent systems (MAS). However, most existing methods rely on restrictive assumptions, such as a fixed number of agents and fully synchronous action execution. These assumptions are often violated in urban systems, where the number of active agents varies over time, and actions may have heterogeneous durations, resulting in a semi-MARL setting. Moreover, while sharing policy parameters among agents is commonly adopted to improve learning efficiency, it can lead to highly homogeneous actions when a subset of agents make decisions concurrently under similar observations, potentially degrading coordination quality. To address these challenges, we propose Adaptive Value Decomposition (AVD), a cooperative MARL framework that adapts to a dynamically changing agent population. AVD further incorporates a lightweight mechanism to mitigate action homogenization induced by shared policies, thereby encouraging behavioral diversity and maintaining effective cooperation among agents. In addition, we design a training-execution strategy tailored to the semi-MARL setting that accommodates asynchronous decision-making when some agents act at different times. Experiments on real-world bike-sharing redistribution tasks in two major cities, London and Washington, D.C., demonstrate that AVD outperforms state-of-the-art baselines, confirming its effectiveness and generalizability.
翻译:多智能体强化学习(MARL)为协调多智能体系统(MAS)提供了一个有前景的范式。然而,现有方法大多依赖于严格的假设,例如固定数量的智能体和完全同步的动作执行。这些假设在城市系统中常常被违背,因为活跃智能体的数量会随时间变化,且动作可能具有异质性的持续时间,从而形成一种半MARL环境。此外,虽然共享智能体间的策略参数被广泛采用以提高学习效率,但当一部分智能体在相似观测下同时做出决策时,这可能导致高度同质化的动作,从而可能降低协调质量。为应对这些挑战,我们提出了自适应价值分解(AVD),这是一个能够适应动态变化智能体群体的协作式MARL框架。AVD进一步引入了一种轻量级机制,以缓解由共享策略引起的动作同质化,从而鼓励行为多样性并维持智能体间的有效协作。此外,我们设计了一种专为半MARL环境定制的训练-执行策略,以适应部分智能体在不同时间采取行动时的异步决策。在伦敦和华盛顿特区两个主要城市的真实世界共享单车调度任务上的实验表明,AVD优于当前最先进的基线方法,证实了其有效性和泛化能力。