When the data used for reinforcement learning (RL) are collected by multiple agents in a distributed manner, federated versions of RL algorithms allow collaborative learning without the need of sharing local data. In this paper, we consider federated Q-learning, which aims to learn an optimal Q-function by periodically aggregating local Q-estimates trained on local data alone. Focusing on infinite-horizon tabular Markov decision processes, we provide sample complexity guarantees for both the synchronous and asynchronous variants of federated Q-learning. In both cases, our bounds exhibit a linear speedup with respect to the number of agents and sharper dependencies on other salient problem parameters. Moreover, existing approaches to federated Q-learning adopt an equally-weighted averaging of local Q-estimates, which can be highly sub-optimal in the asynchronous setting since the local trajectories can be highly heterogeneous due to different local behavior policies. Existing sample complexity scales inverse proportionally to the minimum entry of the stationary state-action occupancy distributions over all agents, requiring that every agent covers the entire state-action space. Instead, we propose a novel importance averaging algorithm, giving larger weights to more frequently visited state-action pairs. The improved sample complexity scales inverse proportionally to the minimum entry of the average stationary state-action occupancy distribution of all agents, thus only requiring the agents collectively cover the entire state-action space, unveiling the blessing of heterogeneity.
翻译:当用于强化学习的数据由多个智能体以分布式方式收集时,联邦版本的强化学习算法可以在无需共享本地数据的情况下实现协同学习。本文研究联邦Q学习,旨在通过周期性地聚合仅在本地数据上训练的本地Q估计来学习最优Q函数。聚焦无穷时域表格型马尔可夫决策过程,我们为联邦Q学习的同步和异步变体提供了样本复杂度保证。两种情况下,我们的界均展现出关于智能体数量的线性加速效应,并在其他关键问题参数上呈现更优的依赖性。此外,现有联邦Q学习方法采用等权重平均本地Q估计,这在异步场景中可能高度次优——由于不同本地行为策略,本地轨迹可能高度异构。现有样本复杂度与所有智能体稳态状态-动作占据分布的最小条目成反比,要求每个智能体覆盖整个状态-动作空间。为此,我们提出一种新颖的重要性加权平均算法,赋予更频繁访问的状态-动作对更大权重。改进后的样本复杂度与所有智能体平均稳态状态-动作占据分布的最小条目成反比,仅需智能体集体覆盖整个状态-动作空间,从而揭示了异构性优势。