Over recent years, Federated Learning (FL) has proven to be one of the most promising methods of distributed learning which preserves data privacy. As the method evolved and was confronted to various real-world scenarios, new challenges have emerged. One such challenge is the presence of highly heterogeneous (often referred as non-IID) data distributions among participants of the FL protocol. A popular solution to this hurdle is Clustered Federated Learning (CFL), which aims to partition clients into groups where the distribution are homogeneous. In the literature, state-of-the-art CFL algorithms are often tested using a few cases of data heterogeneities, without systematically justifying the choices. Further, the taxonomy used for differentiating the different heterogeneity scenarios is not always straightforward. In this paper, we explore the performance of two state-of-theart CFL algorithms with respect to a proposed taxonomy of data heterogeneities in federated learning (FL). We work with three image classification datasets and analyze the resulting clusters against the heterogeneity classes using extrinsic clustering metrics. Our objective is to provide a clearer understanding of the relationship between CFL performances and data heterogeneity scenarios.
翻译:近年来,联邦学习(FL)已被证明是分布式学习中最具前景的方法之一,能够有效保护数据隐私。随着该方法的发展及其在各类现实场景中的应用,新的挑战不断涌现。其中一个关键挑战在于联邦学习协议参与者之间存在高度异构(通常称为非独立同分布)的数据分布。针对这一难题,聚类联邦学习(CFL)作为一种主流解决方案应运而生,其目标是将客户端划分为数据分布同质的群组。现有文献中,前沿的CFL算法通常仅在少数几种数据异构场景下进行测试,且缺乏对场景选择的系统性论证。此外,用于区分不同异构场景的分类标准往往不够明晰。本文基于提出的联邦学习数据异构分类体系,探究两种前沿CFL算法的性能表现。我们使用三个图像分类数据集,并借助外部聚类指标分析算法生成的聚类结果与异构类别之间的关系。本研究旨在更清晰地揭示CFL性能与数据异构场景之间的内在关联。