The development of Policy Iteration (PI) has inspired many recent algorithms for Reinforcement Learning (RL), including several policy gradient methods, that gained both theoretical soundness and empirical success on a variety of tasks. The theory of PI is rich in the context of centralized learning, but its study is still in the infant stage under the federated setting. This paper explores the federated version of Approximate PI (API) and derives its error bound, taking into account the approximation error introduced by environment heterogeneity. We theoretically prove that a proper client selection scheme can reduce this error bound. Based on the theoretical result, we propose a client selection algorithm to alleviate the additional approximation error caused by environment heterogeneity. Experiment results show that the proposed algorithm outperforms other biased and unbiased client selection methods on the federated mountain car problem by effectively selecting clients with a lower level of heterogeneity from the population distribution.
翻译:策略迭代(PI)的发展为近期许多强化学习(RL)算法(包括多种策略梯度方法)提供了灵感,这些方法在各类任务中兼具理论严谨性与实际成功。PI理论在集中式学习背景下内容丰硕,但在联邦环境下的研究仍处于起步阶段。本文探索了近似策略迭代(API)的联邦版本,并推导出其误差界,同时考虑了环境异质性引入的近似误差。我们从理论上证明,合理的客户端选择方案可缩减该误差界。基于这一理论结果,我们提出了一种客户端选择算法,以缓解环境异质性造成的额外近似误差。实验结果表明,在联邦山地车问题中,该算法通过有效选择群体分布中异质性水平较低的客户端,性能优于其他有偏和无偏的客户端选择方法。