Over recent years, Federated Learning (FL) has proven to be one of the most promising methods of distributed learning which preserves data privacy. As the method evolved and was confronted to various real-world scenarios, new challenges have emerged. One such challenge is the presence of highly heterogeneous (often referred as non-IID) data distributions among participants of the FL protocol. A popular solution to this hurdle is Clustered Federated Learning (CFL), which aims to partition clients into groups where the distribution are homogeneous. In the literature, state-of-the-art CFL algorithms are often tested using a few cases of data heterogeneities, without systematically justifying the choices. Further, the taxonomy used for differentiating the different heterogeneity scenarios is not always straightforward. In this paper, we explore the performance of two state-of-theart CFL algorithms with respect to a proposed taxonomy of data heterogeneities in federated learning (FL). We work with three image classification datasets and analyze the resulting clusters against the heterogeneity classes using extrinsic clustering metrics. Our objective is to provide a clearer understanding of the relationship between CFL performances and data heterogeneity scenarios.
翻译:近年来,联邦学习(FL)已被证明是分布式学习中兼顾数据隐私保护的最具前景的方法之一。随着该方法的发展及其在各类现实场景中的应用,新的挑战不断涌现。其中一个挑战在于联邦学习协议参与者间存在高度异构(通常称为非独立同分布)的数据分布。针对这一难题的主流解决方案是聚类联邦学习(CFL),其目标是将客户端划分为数据分布同质的群组。现有文献中,最先进的CFL算法通常仅通过少数几种数据异构场景进行测试,且未系统论证场景选择的合理性。此外,用于区分不同异构场景的分类体系尚不完善。本文基于提出的联邦学习数据异构分类体系,探究两种最先进CFL算法的性能表现。我们在三个图像分类数据集上开展实验,并利用外部聚类指标分析所得聚类结果与异构类别的关系。本研究旨在更清晰地揭示CFL性能与数据异构场景之间的关联性。