In this paper, we consider federated reinforcement learning for tabular episodic Markov Decision Processes (MDP) where, under the coordination of a central server, multiple agents collaboratively explore the environment and learn an optimal policy without sharing their raw data. While linear speedup in the number of agents has been achieved for some metrics, such as convergence rate and sample complexity, in similar settings, it is unclear whether it is possible to design a model-free algorithm to achieve linear regret speedup with low communication cost. We propose two federated Q-Learning algorithms termed as FedQ-Hoeffding and FedQ-Bernstein, respectively, and show that the corresponding total regrets achieve a linear speedup compared with their single-agent counterparts when the time horizon is sufficiently large, while the communication cost scales logarithmically in the total number of time steps $T$. Those results rely on an event-triggered synchronization mechanism between the agents and the server, a novel step size selection when the server aggregates the local estimates of the state-action values to form the global estimates, and a set of new concentration inequalities to bound the sum of non-martingale differences. This is the first work showing that linear regret speedup and logarithmic communication cost can be achieved by model-free algorithms in federated reinforcement learning.
翻译:本文研究表格型情节马尔可夫决策过程的联邦强化学习问题,其中在中央服务器的协调下,多个智能体协作探索环境并学习最优策略,而无需共享原始数据。尽管在类似场景中,某些指标(如收敛速度与样本复杂度)已实现与智能体数量呈线性加速,但目前尚不清楚能否设计一种无模型算法,在低通信代价下实现线性遗憾加速。我们分别提出两种联邦Q学习算法——FedQ-Hoeffding与FedQ-Bernstein,并证明当时域足够长时,其总遗憾相较于单智能体版本实现线性加速,而通信代价随总时间步数$T$呈对数增长。这些结果依赖于智能体与服务器之间的事件触发同步机制、服务器聚合局部状态-动作值估计以形成全局估计时采用的创新步长选取策略,以及一组用于约束非鞅差之和的新型集中不等式。这是首次证明在联邦强化学习中,无模型算法可实现线性遗憾加速与对数通信代价。