The primary objective of this paper is to investigate distributed ordinary differential equation (ODE) and distributed temporal difference (TD) learning algorithms for networked multi-agent Markov decision problems (MAMDPs). In our study, we adopt a distributed multi-agent framework where individual agents have access only to their own rewards, lacking insights into the rewards of other agents. Additionally, each agent has the ability to share its parameters with neighboring agents through a communication network, represented by a graph. Our contributions can be summarized in two key points: 1) We introduce novel distributed ODEs, inspired by the averaging consensus method in the continuous-time domain. The convergence of the ODEs is assessed through control theory perspectives. 2) Building upon the aforementioned ODEs, we devise new distributed TD-learning algorithms. A standout feature of one of our proposed distributed ODEs is its incorporation of two independent dynamic systems, each with a distinct role. This characteristic sets the stage for a novel distributed TD-learning strategy, the convergence of which can potentially be established using Borkar-Meyn theorem.
翻译:本文的主要目标是研究面向网络化多智能体马尔可夫决策问题(MAMDPs)的分布式常微分方程(ODE)与分布式时序差分(TD)学习算法。在本研究中,我们采用了一种分布式多智能体框架,其中每个智能体仅能获取自身奖励,而无法获知其他智能体的奖励信息。此外,每个智能体能够通过由通信图表示的通信网络,与其相邻智能体共享参数。我们的贡献可归纳为两点:1)受连续时间域中的平均共识方法启发,我们提出了新型分布式常微分方程,并通过控制理论视角评估了该方程组的收敛性。2)基于上述常微分方程,我们设计了新的分布式TD学习算法。其中一项关键创新在于,我们提出的一个分布式ODE包含两个具有不同功能的独立动态系统——这一特性为新型分布式TD学习策略奠定了基础,其收敛性可借助Borkar-Meyn定理进行证明。