This paper introduces novel Bellman mappings (B-Maps) for value iteration (VI) in distributed reinforcement learning (DRL), where multiple agents operate over a network without a centralized fusion node. Each agent constructs its own nonparametric B-Map for VI while communicating only with direct neighbors to achieve consensus. These B-Maps operate on Q-functions represented in a reproducing kernel Hilbert space, enabling a nonparametric formulation that allows for flexible, agent-specific basis function design. Unlike existing DRL methods that restrict information exchange to Q-function estimates, the proposed framework also enables agents to share basis information in the form of covariance matrices, capturing additional structural details. A theoretical analysis establishes linear convergence rates for both Q-function and covariance-matrix estimates toward their consensus values. The optimal learning rates for consensus-based updates are dictated by the ratio of the smallest positive eigenvalue to the largest one of the network's Laplacian matrix. Furthermore, each nodal Q-function estimate is shown to lie very close to the fixed point of a centralized nonparametric B-Map, effectively allowing the proposed DRL design to approximate the performance of a centralized fusion center. Numerical experiments on two well-known control problems demonstrate the superior performance of the proposed nonparametric B-Maps compared to prior methods. Notably, the results reveal a counter-intuitive finding: although the proposed approach involves greater information exchange -- specifically through the sharing of covariance matrices -- it achieves the desired performance with lower cumulative communication cost than existing DRL schemes, highlighting the crucial role of basis information in accelerating the learning process.
翻译:本文针对分布式强化学习中的值迭代问题,提出了一种新型贝尔曼映射方法,其中多个智能体在无中心融合节点的网络环境中协同工作。每个智能体构建其自身的非参数贝尔曼映射进行值迭代,同时仅与直接邻居通信以实现共识。这些贝尔曼映射作用于再生核希尔伯特空间中的Q函数,通过非参数化建模支持灵活、适应智能体特性的基函数设计。与现有仅交换Q函数估计值的分布式强化学习方法不同,本框架允许智能体以协方差矩阵形式共享基函数信息,从而捕捉更多结构细节。理论分析证明Q函数与协方差矩阵估计值均以线性收敛速率趋于共识值。基于共识的更新最优学习率由网络拉普拉斯矩阵最小正特征值与最大特征值的比值决定。进一步研究表明,各节点Q函数估计值非常接近中心化非参数贝尔曼映射的不动点,使得所提出的分布式强化学习设计能够有效逼近中心化融合中心的性能。在两个经典控制问题上的数值实验表明,所提出的非参数贝尔曼映射相较于现有方法具有更优越的性能。值得注意的是,研究揭示了一个反直觉的发现:尽管所提方法通过协方差矩阵共享增加了信息交换量,但其达到目标性能所需的累计通信成本反而低于现有分布式强化学习方案,这凸显了基函数信息在加速学习过程中的关键作用。