The development of Policy Iteration (PI) has inspired many recent algorithms for Reinforcement Learning (RL), including several policy gradient methods that gained both theoretical soundness and empirical success on a variety of tasks. The theory of PI is rich in the context of centralized learning, but its study under the federated setting is still in the infant stage. This paper investigates the federated version of Approximate PI (API) and derives its error bound, taking into account the approximation error introduced by environment heterogeneity. We theoretically prove that a proper client selection scheme can reduce this error bound. Based on the theoretical result, we propose a client selection algorithm to alleviate the additional approximation error caused by environment heterogeneity. Experiment results show that the proposed algorithm outperforms other biased and unbiased client selection methods on the federated mountain car problem and the Mujoco Hopper problem by effectively selecting clients with a lower level of heterogeneity from the population distribution.
翻译:策略迭代(PI)的发展启发了最近许多强化学习(RL)算法,包括若干兼具理论严谨性与多种任务实证成功的策略梯度方法。PI理论在集中式学习背景下已相当丰富,但其在联邦环境下的研究仍处于起步阶段。本文研究了近似策略迭代(API)的联邦版本,并推导了其误差界,其中考虑了环境异质性引入的近似误差。我们从理论上证明,合理的客户端选择方案能够减小这一误差界。基于理论结果,我们提出了一种客户端选择算法,以减轻环境异质性导致的额外近似误差。实验结果表明,所提算法通过有效选择具有较低异质性水平的客户端,在联邦山地车问题与Mujoco Hopper问题上优于其他有偏与无偏的客户端选择方法。