Multi-Agent Congestion Cost Minimization With Linear Function Approximations

This work considers multiple agents traversing a network from a source node to the goal node. The cost to an agent for traveling a link has a private as well as a congestion component. The agent's objective is to find a path to the goal node with minimum overall cost in a decentralized way. We model this as a fully decentralized multi-agent reinforcement learning problem and propose a novel multi-agent congestion cost minimization (MACCM) algorithm. Our MACCM algorithm uses linear function approximations of transition probabilities and the global cost function. In the absence of a central controller and to preserve privacy, agents communicate the cost function parameters to their neighbors via a time-varying communication network. Moreover, each agent maintains its estimate of the global state-action value, which is updated via a multi-agent extended value iteration (MAEVI) sub-routine. We show that our MACCM algorithm achieves a sub-linear regret. The proof requires the convergence of cost function parameters, the MAEVI algorithm, and analysis of the regret bounds induced by the MAEVI triggering condition for each agent. We implement our algorithm on a two node network with multiple links to validate it. We first identify the optimal policy, the optimal number of agents going to the goal node in each period. We observe that the average regret is close to zero for 2 and 3 agents. The optimal policy captures the trade-off between the minimum cost of staying at a node and the congestion cost of going to the goal node. Our work is a generalization of learning the stochastic shortest path problem.

翻译：本文考虑多个智能体从源节点穿越网络到达目标节点的问题。智能体通过链路时，其成本包含私有成本与拥塞成本两部分。智能体的目标是在分散式框架下寻找一条总成本最小的路径。我们将该问题建模为完全分散的多智能体强化学习问题，并提出一种新颖的多智能体拥塞成本最小化（MACCM）算法。该算法采用线性函数逼近转移概率与全局成本函数。在无中央控制器且为保护隐私的前提下，智能体通过时变通信网络向邻居传递成本函数参数。此外，每个智能体保持对全局状态-动作价值的估计，并通过多智能体扩展值迭代（MAEVI）子程序进行更新。我们证明MACCM算法可实现次线性遗憾。证明过程需要确保成本函数参数的收敛性、MAEVI算法的收敛性，以及基于每个智能体MAEVI触发条件所推导的遗憾界分析。我们在包含多链路的双节点网络上验证了该算法：首先识别出最优策略，即每个周期内前往目标节点的最优智能体数量。实验表明，对于2个和3个智能体，平均遗憾趋近于零。最优策略刻画了停留节点的最小成本与前往目标节点的拥塞成本之间的权衡关系。本研究是对随机最短路径学习问题的泛化推广。