In Internet-of-Things systems, federated learning has advanced online reinforcement learning (RL) by enabling parallel policy training without sharing raw data. However, interacting with real environments online can be risky and costly, motivating offline federated RL (FRL), where local devices learn from fixed datasets. Despite its promise, offline FRL may break down under low-quality, heterogeneous data. Offline RL tends to get stuck in local optima, and in FRL, one device's suboptimal policy can degrade the aggregated model, i.e., policy pollution. We present FORLER, combining Q-ensemble aggregation on the server with actor rectification on devices. The server robustly merges device Q-functions to curb policy pollution and shift heavy computation off resource-constrained hardware without compromising privacy. Locally, actor rectification enriches policy gradients via a zeroth-order search for high-Q actions plus a bespoke regularizer that nudges the policy toward them. A $δ$-periodic strategy further reduces local computation. We theoretically provide safe policy improvement performance guarantees. Extensive experiments show FORLER consistently outperforms strong baselines under varying data quality and heterogeneity.
翻译:在物联网系统中,联邦学习通过支持并行策略训练而无需共享原始数据,推动了在线强化学习的发展。然而,在线与真实环境交互存在风险且成本高昂,这促使了离线联邦强化学习的发展,即本地设备从固定数据集中学习。尽管前景广阔,离线联邦强化学习在低质量、异构数据下可能失效。离线强化学习容易陷入局部最优,而在联邦强化学习中,单个设备的次优策略可能降低聚合模型性能,即策略污染。本文提出FORLER方法,结合服务器端的Q-集成聚合与设备端的执行器校正技术。服务器端通过鲁棒地融合设备Q函数来抑制策略污染,并将大量计算任务从资源受限的硬件转移至服务器,同时保障隐私安全。在本地,执行器校正通过零阶搜索高Q值动作及定制化正则化器(引导策略向高Q值动作靠近)来丰富策略梯度。采用δ周期策略进一步减少本地计算负担。我们从理论上提供了安全策略改进的性能保证。大量实验表明,在不同数据质量与异构性条件下,FORLER均持续优于现有强基线方法。