Classic Machine Learning techniques require training on data available in a single data lake. However, aggregating data from different owners is not always convenient for different reasons, including security, privacy and secrecy. Data carry a value that might vanish when shared with others; the ability to avoid sharing the data enables industrial applications where security and privacy are of paramount importance, making it possible to train global models by implementing only local policies which can be run independently and even on air-gapped data centres. Federated Learning (FL) is a distributed machine learning approach which has emerged as an effective way to address privacy concerns by only sharing local AI models while keeping the data decentralized. Two critical challenges of Federated Learning are managing the heterogeneous systems in the same federated network and dealing with real data, which are often not independently and identically distributed (non-IID) among the clients. In this paper, we focus on the second problem, i.e., the problem of statistical heterogeneity of the data in the same federated network. In this setting, local models might be strayed far from the local optimum of the complete dataset, thus possibly hindering the convergence of the federated model. Several Federated Learning algorithms, such as FedAvg, FedProx and Federated Curvature (FedCurv), aiming at tackling the non-IID setting, have already been proposed. This work provides an empirical assessment of the behaviour of FedAvg and FedCurv in common non-IID scenarios. Results show that the number of epochs per round is an important hyper-parameter that, when tuned appropriately, can lead to significant performance gains while reducing the communication cost. As a side product of this work, we release the non-IID version of the datasets we used so to facilitate further comparisons from the FL community.
翻译:经典机器学习技术要求训练数据集中于单一数据湖。然而,由于安全、隐私和保密性等多种原因,聚合不同所有者的数据并非总是便捷可行。数据所承载的价值可能在共享时消减;避免数据共享的能力使得安全与隐私至关重要的工业应用成为可能,通过仅实施可独立运行甚至用于气隙数据中心的本地策略即可训练全局模型。联邦学习作为一种分布式机器学习方法,通过仅共享本地AI模型而保持数据去中心化,已成为解决隐私问题的有效途径。联邦学习面临两大关键挑战:管理同一联邦网络中的异构系统,以及处理客户端间常呈非独立同分布的真实数据。本文聚焦第二个问题,即同一联邦网络中数据的统计异质性。在此背景下,局部模型可能大幅偏离完整数据集的局部最优值,从而阻碍联邦模型的收敛。目前已提出多种旨在应对非独立同分布场景的联邦学习算法,如FedAvg、FedProx和联邦曲率。本研究对FedAvg与FedCurv在常见非独立同分布场景下的行为进行了实证评估。结果表明,每轮训练轮数是重要的超参数,合理调整后可显著提升性能并降低通信成本。作为本工作的衍生产品,我们发布了所用数据集的非独立同分布版本,以便联邦学习社区进行进一步比较。