异步联邦强化学习与策略梯度更新：算法设计与收敛性分析 (Asynchronous Federated Reinforcement Learning with Policy Gradient Updates: Algorithm Design and Convergence Analysis)

To improve the efficiency of reinforcement learning (RL), we propose a novel asynchronous federated reinforcement learning (FedRL) framework termed AFedPG, which constructs a global model through collaboration among $N$ agents using policy gradient (PG) updates. To address the challenge of lagged policies in asynchronous settings, we design a delay-adaptive lookahead technique \textit{specifically for FedRL} that can effectively handle heterogeneous arrival times of policy gradients. We analyze the theoretical global convergence bound of AFedPG, and characterize the advantage of the proposed algorithm in terms of both the sample complexity and time complexity. Specifically, our AFedPG method achieves $O(\frac{{\epsilon}^{-2.5}}{N})$ sample complexity for global convergence at each agent on average. Compared to the single agent setting with $O(\epsilon^{-2.5})$ sample complexity, it enjoys a linear speedup with respect to the number of agents. Moreover, compared to synchronous FedPG, AFedPG improves the time complexity from $O(\frac{t_{\max}}{N})$ to $O({\sum_{i=1}^{N} \frac{1}{t_{i}}})^{-1}$, where $t_{i}$ denotes the time consumption in each iteration at agent $i$, and $t_{\max}$ is the largest one. The latter complexity $O({\sum_{i=1}^{N} \frac{1}{t_{i}}})^{-1}$ is always smaller than the former one, and this improvement becomes significant in large-scale federated settings with heterogeneous computing powers ($t_{\max}\gg t_{\min}$). Finally, we empirically verify the improved performance of AFedPG in four widely-used MuJoCo environments with varying numbers of agents. We also demonstrate the advantages of AFedPG in various computing heterogeneity scenarios.

翻译：为提高强化学习（RL）的效率，我们提出了一种新颖的异步联邦强化学习（FedRL）框架，称为AFedPG。该框架通过$N$个智能体使用策略梯度（PG）更新进行协作，构建全局模型。为应对异步设置中策略滞后的挑战，我们设计了一种专为FedRL定制的延迟自适应前瞻技术，能够有效处理策略梯度到达时间的异构性。我们分析了AFedPG的理论全局收敛界，并从样本复杂度和时间复杂度两方面阐述了所提算法的优势。具体而言，我们的AFedPG方法在每个智能体上平均实现全局收敛的样本复杂度为$O(\frac{{\epsilon}^{-2.5}}{N})$。与单智能体设置下$O(\epsilon^{-2.5})$的样本复杂度相比，该算法在智能体数量上实现了线性加速。此外，与同步FedPG相比，AFedPG将时间复杂度从$O(\frac{t_{\max}}{N})$改进为$O({\sum_{i=1}^{N} \frac{1}{t_{i}}})^{-1}$，其中$t_{i}$表示智能体$i$每次迭代的时间消耗，$t_{\max}$为其中的最大值。后一复杂度$O({\sum_{i=1}^{N} \frac{1}{t_{i}}})^{-1}$始终小于前者，且这一改进在具有异构计算能力的大规模联邦设置中（$t_{\max}\gg t_{\min}$）尤为显著。最后，我们在四个广泛使用的MuJoCo环境中，通过不同数量的智能体对AFedPG的性能提升进行了实证验证。我们还展示了AFedPG在多种计算异构场景下的优势。