The development of Policy Iteration (PI) has inspired many recent algorithms for Reinforcement Learning (RL), including several policy gradient methods, that gained both theoretical soundness and empirical success on a variety of tasks. The theory of PI is rich in the context of centralized learning, but its study is still in the infant stage under the federated setting. This paper explores the federated version of Approximate PI (API) and derives its error bound, taking into account the approximation error introduced by environment heterogeneity. We theoretically prove that a proper client selection scheme can reduce this error bound. Based on the theoretical result, we propose a client selection algorithm to alleviate the additional approximation error caused by environment heterogeneity. Experiment results show that the proposed algorithm outperforms other biased and unbiased client selection methods on the federated mountain car problem by effectively selecting clients with a lower level of heterogeneity from the population distribution.
翻译:策略迭代(PI)的发展启发了许多近期强化学习(RL)算法,包括多种策略梯度方法,这些方法在多种任务上兼具理论严谨性与实证成功。PI理论在集中式学习背景下已臻完善,但在联邦学习环境中的研究仍处于早期阶段。本文探索了近似策略迭代(API)的联邦化版本,推导了其误差界,并特别考虑了环境异质性引入的近似误差。我们从理论上证明,合理的客户端选择方案能够降低该误差界。基于理论结果,我们提出一种客户端选择算法,用以缓解环境异质性造成的额外近似误差。实验结果表明,在联邦山地车问题上,该算法通过从群体分布中有效选择异质性较低的客户端,性能优于其他有偏和无偏的客户端选择方法。