Federated Learning (FL) emerged as a practical approach to training a model from decentralized data. The proliferation of FL led to the development of numerous FL algorithms and mechanisms. Many prior efforts have given their primary focus on accuracy of those approaches, but there exists little understanding of other aspects such as computational overheads, performance and training stability, etc. To bridge this gap, we conduct extensive performance evaluation on several canonical FL algorithms (FedAvg, FedProx, FedYogi, FedAdam, SCAFFOLD, and FedDyn) by leveraging an open-source federated learning framework called Flame. Our comprehensive measurement study reveals that no single algorithm works best across different performance metrics. A few key observations are: (1) While some state-of-the-art algorithms achieve higher accuracy than others, they incur either higher computation overheads (FedDyn) or communication overheads (SCAFFOLD). (2) Recent algorithms present smaller standard deviation in accuracy across clients than FedAvg, indicating that the advanced algorithms' performances are stable. (3) However, algorithms such as FedDyn and SCAFFOLD are more prone to catastrophic failures without the support of additional techniques such as gradient clipping. We hope that our empirical study can help the community to build best practices in evaluating FL algorithms.
翻译:联邦学习(FL)作为一种从分散数据中训练模型的实用方法应运而生。FL的蓬勃发展催生了大量FL算法与机制的研发。先前诸多工作主要关注这些方法的准确性,但对计算开销、性能及训练稳定性等其他方面的理解却十分有限。为弥补这一不足,我们借助名为Flame的开源联邦学习框架,对几种经典FL算法(FedAvg、FedProx、FedYogi、FedAdam、SCAFFOLD和FedDyn)进行了全面的性能评估。我们的综合测量研究表明,没有任何单一算法能在所有性能指标上均表现最佳。几个关键发现如下:(1)尽管某些最先进的算法相较于其他算法能获得更高精度,但它们要么带来更高的计算开销(如FedDyn),要么带来更高的通信开销(如SCAFFOLD)。(2)相较于FedAvg,近期算法在客户端间的精度标准差更小,表明这些先进算法的性能更为稳定。(3)然而,在没有梯度裁剪等额外技术支持的情况下,FedDyn和SCAFFOLD等算法更易出现灾难性故障。我们希望这项实证研究能帮助社区建立联邦学习算法评估的最佳实践。