Offline reinforcement learning (RL), which seeks to learn an optimal policy using offline data, has garnered significant interest due to its potential in critical applications where online data collection is infeasible or expensive. This work explores the benefit of federated learning for offline RL, aiming at collaboratively leveraging offline datasets at multiple agents. Focusing on finite-horizon episodic tabular Markov decision processes (MDPs), we design FedLCB-Q, a variant of the popular model-free Q-learning algorithm tailored for federated offline RL. FedLCB-Q updates local Q-functions at agents with novel learning rate schedules and aggregates them at a central server using importance averaging and a carefully designed pessimistic penalty term. Our sample complexity analysis reveals that, with appropriately chosen parameters and synchronization schedules, FedLCB-Q achieves linear speedup in terms of the number of agents without requiring high-quality datasets at individual agents, as long as the local datasets collectively cover the state-action space visited by the optimal policy, highlighting the power of collaboration in the federated setting. In fact, the sample complexity almost matches that of the single-agent counterpart, as if all the data are stored at a central location, up to polynomial factors of the horizon length. Furthermore, FedLCB-Q is communication-efficient, where the number of communication rounds is only linear with respect to the horizon length up to logarithmic factors.
翻译:离线强化学习旨在利用离线数据学习最优策略,因在在线数据收集不可行或成本高昂的关键应用中的潜力而备受关注。本研究探讨联邦学习对离线强化学习的促进作用,旨在协作利用多个智能体的离线数据集。针对有限时域情景表格型马尔可夫决策过程,我们设计了FedLCB-Q——一种专为联邦离线强化学习定制的流行无模型Q学习算法变体。该算法在智能体上采用新颖的学习率调度更新本地Q函数,并通过重要性平均和精心设计的悲观惩罚项在中央服务器上聚合这些函数。我们的样本复杂度分析表明:在适当选择参数和同步调度的情况下,只要本地数据集联合覆盖最优策略所访问的状态-动作空间,FedLCB-Q就能在无需单个智能体具备高质量数据集的前提下,实现与智能体数量线性相关的加速效果,突显了联邦设置中协作的力量。事实上,其样本复杂度几乎与单智能体场景(即所有数据集中存储于中心位置)相匹配,仅存在时域长度的多项式因子差异。此外,FedLCB-Q具有通信高效性,其通信轮次与对数因子下的时域长度呈线性关系。