We propose a novel hierarchical Bayesian approach to Federated Learning (FL), where our model reasonably describes the generative process of clients' local data via hierarchical Bayesian modeling: constituting random variables of local models for clients that are governed by a higher-level global variate. Interestingly, the variational inference in our Bayesian model leads to an optimisation problem whose block-coordinate descent solution becomes a distributed algorithm that is separable over clients and allows them not to reveal their own private data at all, thus fully compatible with FL. We also highlight that our block-coordinate algorithm has particular forms that subsume the well-known FL algorithms including Fed-Avg and Fed-Prox as special cases. Beyond introducing novel modeling and derivations, we also offer convergence analysis showing that our block-coordinate FL algorithm converges to an (local) optimum of the objective at the rate of $O(1/\sqrt{t})$, the same rate as regular (centralised) SGD, as well as the generalisation error analysis where we prove that the test error of our model on unseen data is guaranteed to vanish as we increase the training data size, thus asymptotically optimal.
翻译:我们提出一种新颖的分层贝叶斯联邦学习方法。该模型通过分层贝叶斯建模合理描述了客户端本地数据的生成过程:构建由更高层次全局变量支配的客户端本地模型随机变量。有趣的是,该贝叶斯模型中的变分推断导出了一个优化问题,其块坐标下降解形成了一个可分离于各客户端的分布式算法,且完全无需客户端暴露私有数据,从而与联邦学习完全兼容。我们特别指出,该块坐标算法具有特定形式,能够将Fed-Avg和Fed-Prox等经典联邦学习算法涵盖为特例。除引入新颖的建模与推导外,我们还提供了收敛性分析,证明该块坐标联邦学习算法以$O(1/\sqrt{t})$的速率收敛至目标函数的(局部)最优解——此速率与常规(集中式)随机梯度下降法相同;同时通过泛化误差分析证明,随着训练数据量的增加,模型在未见数据上的测试误差必然趋于零,从而具备渐近最优性。