The primary objective of this paper is to investigate distributed dynamic programming (DP) and distributed temporal difference (TD) learning algorithms for networked multi-agent Markov decision problems (MAMDPs). In our study, we adopt a distributed multi-agent framework where individual agents have access only to their own rewards, lacking insights into the rewards of other agents. Additionally, each agent has the ability to share its parameters with neighboring agents through a communication network, represented by a graph. Our contributions can be summarized in two key points: 1) We introduce a novel distributed DP, inspired by the averaging consensus method in the continuous-time domain. The convergence of this DP is assessed through control theory perspectives. 2) Building upon the aforementioned DP, we devise a new distributed TD-learning algorithm and prove its convergence. A standout feature of our proposed distributed DP is its incorporation of two independent dynamic systems, each with a distinct role. This characteristic sets the stage for a novel distributed TD-learning strategy, the convergence of which can be directly established using the Borkar-Meyn theorem.
翻译:本文的主要目标是研究面向网络化多智能体马尔可夫决策问题(MAMDP)的分布式动态规划(DP)与分布式时序差分(TD)学习算法。我们采用一种分布式多智能体框架,其中每个智能体仅能获取自身奖励,无法获知其他智能体的奖励信息。此外,各智能体可通过由图表示的通信网络,与相邻智能体共享其参数。我们的贡献可归纳为两个核心要点:1)受连续时间域中的平均共识方法启发,提出一种新颖的分布式动态规划算法,并通过控制理论视角评估其收敛性;2)基于上述分布式动态规划算法,设计了一种新型分布式TD学习算法并证明其收敛性。本文提出的分布式动态规划算法的一个显著特点在于,它整合了两个独立且功能各异的动态系统。这一特性为新型分布式TD学习策略奠定了基础,其收敛性可直接通过Borkar-Meyn定理建立。