In this paper, we consider federated reinforcement learning for tabular episodic Markov Decision Processes (MDP) where, under the coordination of a central server, multiple agents collaboratively explore the environment and learn an optimal policy without sharing their raw data. While linear speedup in the number of agents has been achieved for some metrics, such as convergence rate and sample complexity, in similar settings, it is unclear whether it is possible to design a model-free algorithm to achieve linear regret speedup with low communication cost. We propose two federated Q-Learning algorithms termed as FedQ-Hoeffding and FedQ-Bernstein, respectively, and show that the corresponding total regrets achieve a linear speedup compared with their single-agent counterparts when the time horizon is sufficiently large, while the communication cost scales logarithmically in the total number of time steps $T$. Those results rely on an event-triggered synchronization mechanism between the agents and the server, a novel step size selection when the server aggregates the local estimates of the state-action values to form the global estimates, and a set of new concentration inequalities to bound the sum of non-martingale differences. This is the first work showing that linear regret speedup and logarithmic communication cost can be achieved by model-free algorithms in federated reinforcement learning.
翻译:本文研究表格化情节马尔可夫决策过程(MDP)的联邦强化学习问题。在中央服务器的协调下,多个智能体协作探索环境并学习最优策略,无需共享原始数据。尽管在类似场景中,针对收敛速率、样本复杂度等指标已实现智能体数量的线性加速,但在低通信成本下能否设计出实现线性遗憾加速的无模型算法仍不明确。为此,我们提出两种联邦Q学习算法——FedQ-Hoeffding与FedQ-Bernstein。研究表明,在时间跨度足够大时,两者总遗憾相对于单智能体情形可实现线性加速,同时通信成本随总时间步数$T$呈对数增长。这些结果依赖于以下关键设计:智能体与服务器之间的事件触发同步机制、服务器聚合局部状态-动作值估计以形成全局估计时采用的新型步长选择策略,以及一组用于约束非鞅差求和的新集中不等式。这是首个证明无模型算法能在联邦强化学习中同时实现线性遗憾加速与对数通信成本的工作。