This paper presents the design and implementation of FLIPS, a middleware system to manage data and participant heterogeneity in federated learning (FL) training workloads. In particular, we examine the benefits of label distribution clustering on participant selection in federated learning. FLIPS clusters parties involved in an FL training job based on the label distribution of their data apriori, and during FL training, ensures that each cluster is equitably represented in the participants selected. FLIPS can support the most common FL algorithms, including FedAvg, FedProx, FedDyn, FedOpt and FedYogi. To manage platform heterogeneity and dynamic resource availability, FLIPS incorporates a straggler management mechanism to handle changing capacities in distributed, smart community applications. Privacy of label distributions, clustering and participant selection is ensured through a trusted execution environment (TEE). Our comprehensive empirical evaluation compares FLIPS with random participant selection, as well as two other "smart" selection mechanisms - Oort and gradient clustering using two real-world datasets, two different non-IID distributions and three common FL algorithms (FedYogi, FedProx and FedAvg). We demonstrate that FLIPS significantly improves convergence, achieving higher accuracy by 17 - 20 % with 20 - 60 % lower communication costs, and these benefits endure in the presence of straggler participants.
翻译:本文介绍了FLIPS的设计与实现,这是一个用于管理联邦学习训练工作中数据和参与者异构性的中间件系统。我们特别研究了标签分布聚类对联邦学习参与者选择的益处。FLIPS根据各参与方数据的标签分布预先对其进行聚类,并在联邦学习训练过程中确保每个聚类在所选参与者中得到公平代表。FLIPS支持最常见的联邦学习算法,包括FedAvg、FedProx、FedDyn、FedOpt和FedYogi。为应对平台异构性和动态资源可用性,FLIPS集成了一种掉队者管理机制,以处理分布式智能社区应用中变化的能力需求。通过可信执行环境确保标签分布、聚类和参与者选择的隐私性。我们全面的实证评估将FLIPS与随机参与者选择以及另外两种"智能"选择机制——Oort和梯度聚类进行了比较,使用了两个真实世界数据集、两种不同的非独立同分布数据分布和三种常见联邦学习算法(FedYogi、FedProx和FedAvg)。研究表明,FLIPS显著提升了收敛性能,实现了17%-20%的准确率提升,同时通信成本降低20%-60%,且这些优势在存在掉队参与者的情况下依然保持。