We study problems of federated control in Markov Decision Processes. To solve an MDP with large state space, multiple learning agents are introduced to collaboratively learn its optimal policy without communication of locally collected experience. In our settings, these agents have limited capabilities, which means they are restricted within different regions of the overall state space during the training process. In face of the difference among restricted regions, we firstly introduce concepts of leakage probabilities to understand how such heterogeneity affects the learning process, and then propose a novel communication protocol that we call Federated-Q protocol (FedQ), which periodically aggregates agents' knowledge of their restricted regions and accordingly modifies their learning problems for further training. In terms of theoretical analysis, we justify the correctness of FedQ as a communication protocol, then give a general result on sample complexity of derived algorithms FedQ-X with the RL oracle , and finally conduct a thorough study on the sample complexity of FedQ-SynQ. Specifically, FedQ-X has been shown to enjoy linear speedup in terms of sample complexity when workload is uniformly distributed among agents. Moreover, we carry out experiments in various environments to justify the efficiency of our methods.
翻译:我们研究了马尔可夫决策过程中的联邦控制问题。为求解具有大状态空间的MDP,引入多个学习代理在不共享本地采集经验的情况下协同学习最优策略。在我们的设定中,这些代理能力有限,即训练过程中被限制在整体状态空间的不同区域内。针对受限区域间的差异性,我们首先引入泄露概率概念来理解这种异质性如何影响学习过程,随后提出一种名为联邦Q协议(FedQ)的新型通信协议,该协议定期聚合各代理对其受限区域的知识,并据此修改其学习问题进行后续训练。在理论分析方面,我们证明了FedQ作为通信协议的正确性,给出了基于RL预言机的衍生算法FedQ-X样本复杂度的通用结论,最后对FedQ-SynQ的样本复杂度进行了深入研究。具体而言,当工作负载在代理间均匀分布时,FedQ-X在样本复杂度上实现了线性加速。此外,我们在多种环境下进行了实验以验证所提方法的有效性。