We address real-time sampling and estimation of autoregressive Markovian sources in dynamic yet structurally similar multi-hop wireless networks. Each node caches samples from others and communicates over wireless collision channels, aiming to minimize time-average estimation error via decentralized policies. Due to the high dimensionality of action spaces and complexity of network topologies, deriving optimal policies analytically is intractable. To address this, we propose a graphical multi-agent reinforcement learning framework for policy optimization. Theoretically, we demonstrate that our proposed policies are transferable, allowing a policy trained on one graph to be effectively applied to structurally similar graphs. Numerical experiments demonstrate that (i) our proposed policy outperforms state-of-the-art baselines; (ii) the trained policies are transferable to larger networks, with performance gains increasing with the number of agents; (iii) the graphical training procedure withstands non-stationarity, even when using independent learning techniques; and (iv) recurrence is pivotal in both independent learning and centralized training and decentralized execution, and improves the resilience to non-stationarity.
翻译:本文研究动态但结构相似的多跳无线网络中自回归马尔可夫源的实时采样与估计问题。每个节点缓存来自其他节点的样本,并通过无线碰撞信道进行通信,旨在通过去中心化策略最小化时间平均估计误差。由于动作空间的高维性和网络拓扑的复杂性,解析推导最优策略是不可行的。为此,我们提出了一种用于策略优化的图多智能体强化学习框架。理论上,我们证明了所提策略具有可迁移性,允许在一个图上训练的策略有效应用于结构相似的图。数值实验表明:(i)所提策略优于现有先进基线方法;(ii)训练所得策略可迁移至更大规模网络,且性能增益随智能体数量增加而提升;(iii)图训练过程能够承受非平稳性,即使在使用独立学习技术时亦然;(iv)循环机制在独立学习与集中训练分散执行中均至关重要,并能提升对非平稳性的鲁棒性。