In this paper, we propose a reinforcement learning algorithm to solve a multi-agent Markov decision process (MMDP). The goal, inspired by Blackwell's Approachability Theorem, is to lower the time average cost of each agent to below a pre-specified agent-specific bound. For the MMDP, we assume the state dynamics to be controlled by the joint actions of agents, but the per-stage costs to only depend on the individual agent's actions. We combine the Q-learning algorithm for a weighted combination of the costs of each agent, obtained by a gossip algorithm with the Metropolis-Hastings or Multiplicative Weights formalisms to modulate the averaging matrix of the gossip. We use multiple timescales in our algorithm and prove that under mild conditions, it approximately achieves the desired bounds for each of the agents. We also demonstrate the empirical performance of this algorithm in the more general setting of MMDPs having jointly controlled per-stage costs.
翻译:本文提出了一种强化学习算法,用于求解多智能体马尔可夫决策过程。受布莱克威尔逼近性定理启发,目标是降低每个智能体的时间平均成本,使其低于预先指定的智能体特定阈值。针对该多智能体马尔可夫决策过程,我们假设状态动态由智能体的联合行动控制,但每阶段成本仅依赖于单个智能体的行动。我们将Q学习算法与通过八卦算法获得的各智能体成本的加权组合相结合,采用梅特罗波利斯-哈斯廷斯或乘法权重形式化方法调节八卦的平均矩阵。算法采用多时间尺度框架,并证明在温和条件下,该算法能近似实现每个智能体的目标阈值。此外,我们还在具有联合控制每阶段成本的多智能体马尔可夫决策过程更一般场景下,展示了该算法的实证性能。