In this paper, we propose a one-shot distributed learning algorithm via refitting bootstrap samples, which we refer to as ReBoot. ReBoot refits a new model to mini-batches of bootstrap samples that are continuously drawn from each of the locally fitted models. It requires only one round of communication of model parameters without much memory. Theoretically, we analyze the statistical error rate of ReBoot for generalized linear models (GLM) and noisy phase retrieval, which represent convex and non-convex problems, respectively. In both cases, ReBoot provably achieves the full-sample statistical rate. In particular, we show that the systematic bias of ReBoot, the error that is independent of the number of subsamples (i.e., the number of sites), is $O(n ^ {-2})$ in GLM, where $n$ is the subsample size (the sample size of each local site). This rate is sharper than that of model parameter averaging and its variants, implying the higher tolerance of ReBoot with respect to data splits to maintain the full-sample rate. Our simulation study demonstrates the statistical advantage of ReBoot over competing methods. Finally, we propose FedReBoot, an iterative version of ReBoot, to aggregate convolutional neural networks for image classification. FedReBoot exhibits substantial superiority over Federated Averaging (FedAvg) within early rounds of communication.
翻译:摘要:本文提出了一种通过重拟合自助样本的单轮分布式学习算法,称为ReBoot。该算法对从各局部拟合模型中持续抽取的小批量自助样本重新拟合新模型,仅需一轮模型参数通信且无需大量存储。理论上,我们分别针对广义线性模型(GLM)和带噪声相位恢复问题(分别代表凸优化与非凸优化问题)分析了ReBoot的统计误差率。在这两种情形下,ReBoot均被证明能达到全样本统计率。特别地,我们证明ReBoot的系统性偏差(即与子样本量或站点数无关的误差)在GLM中为$O(n^{-2})$(其中$n$为各局部站点的子样本量)。该误差率优于模型参数平均及其变体方法,表明ReBoot在维持全样本统计率时对数据划分具有更高的容忍度。仿真实验验证了ReBoot相较于竞争方法的统计优势。最后,我们提出FedReBoot(ReBoot的迭代版本)用于图像分类中的卷积神经网络聚合。在通信早期回合中,FedReBoot的收敛性能显著优于联邦平均算法(FedAvg)。