We consider the problem of federated Q-learning, where $M$ agents aim to collaboratively learn the optimal Q-function of an unknown infinite-horizon Markov decision process with finite state and action spaces. We investigate the trade-off between sample and communication complexities for the widely used class of intermittent communication algorithms. We first establish the converse result, where it is shown that a federated Q-learning algorithm that offers any speedup with respect to the number of agents in the per-agent sample complexity needs to incur a communication cost of at least an order of $\frac{1}{1-\gamma}$ up to logarithmic factors, where $\gamma$ is the discount factor. We also propose a new algorithm, called Fed-DVR-Q, which is the first federated Q-learning algorithm to simultaneously achieve order-optimal sample and communication complexities. Thus, together these results provide a complete characterization of the sample-communication complexity trade-off in federated Q-learning.
翻译:我们考虑联邦Q学习问题,其中$M$个智能体旨在协作学习一个具有有限状态和动作空间的未知无限时域马尔可夫决策过程的最优Q函数。我们针对广泛使用的间歇通信算法类别,研究了样本复杂度与通信复杂度之间的权衡关系。我们首先建立了逆命题,证明了任何能够在单智能体样本复杂度方面相对于智能体数量提供加速效果的联邦Q学习算法,其通信成本至少需要达到$\frac{1}{1-\gamma}$量级(忽略对数因子),其中$\gamma$为折扣因子。我们还提出了一种名为Fed-DVR-Q的新算法,这是首个能够同时达到最优阶样本复杂度和通信复杂度的联邦Q学习算法。因此,这些结果共同完整刻画了联邦Q学习中样本-通信复杂度的权衡关系。