The main goal of this paper is to investigate distributed dynamic programming (DP) to solve networked multi-agent Markov decision problems (MDPs). We consider a distributed multi-agent case, where each agent does not have an access to the rewards of other agents except for its own reward. Moreover, each agent can share their parameters with its neighbors over a communication network represented by a graph. We propose a distributed DP in the continuous-time domain, and prove its convergence through control theoretic viewpoints. The proposed analysis can be viewed as a preliminary ordinary differential equation (ODE) analysis of a distributed temporal difference learning algorithm, whose convergence can be proved using Borkar-Meyn theorem and the single time-scale approach.
翻译:本文主要研究求解网络化多智能体马尔可夫决策问题的分布式动态规划方法。我们考虑一个分布式多智能体场景,其中每个智能体除自身奖励外无法获取其他智能体的奖励信息。此外,各智能体可通过图表示的通信网络与相邻智能体共享其参数。我们提出了连续时间域上的分布式动态规划算法,并从控制理论视角证明了其收敛性。所提分析可视为分布式时序差分学习算法的常微分方程初步分析,其收敛性可通过Borkar-Meyn定理与单时间尺度方法得到证明。