Fair Federated Data Clustering through Personalization: Bridging the Gap between Diverse Data Distributions

The rapid growth of data from edge devices has catalyzed the performance of machine learning algorithms. However, the data generated resides at client devices thus there are majorly two challenge faced by traditional machine learning paradigms - centralization of data for training and secondly for most the generated data the class labels are missing and there is very poor incentives to clients to manually label their data owing to high cost and lack of expertise. To overcome these issues, there have been initial attempts to handle unlabelled data in a privacy preserving distributed manner using unsupervised federated data clustering. The goal is partition the data available on clients into $k$ partitions (called clusters) without actual exchange of data. Most of the existing algorithms are highly dependent on data distribution patterns across clients or are computationally expensive. Furthermore, due to presence of skewed nature of data across clients in most of practical scenarios existing models might result in clients suffering high clustering cost making them reluctant to participate in federated process. To this, we are first to introduce the idea of personalization in federated clustering. The goal is achieve balance between achieving lower clustering cost and at same time achieving uniform cost across clients. We propose p-FClus that addresses these goal in a single round of communication between server and clients. We validate the efficacy of p-FClus against variety of federated datasets showcasing it's data independence nature, applicability to any finite $\ell$-norm, while simultaneously achieving lower cost and variance.

翻译：边缘设备数据的快速增长推动了机器学习算法的性能提升。然而，生成的数据驻留在客户端设备中，因此传统机器学习范式主要面临两大挑战：一是训练数据的集中化需求，二是大多数生成数据缺乏类别标签，且由于成本高昂和专业知识不足，客户端手动标注数据的意愿极低。为解决这些问题，已有初步尝试通过无监督联邦数据聚类以隐私保护的分布式方式处理未标记数据。其目标是在不实际交换数据的情况下，将客户端上的数据划分为 $k$ 个分区（称为聚类）。现有算法大多高度依赖客户端间的数据分布模式或计算成本高昂。此外，由于大多数实际场景中客户端数据存在偏斜性，现有模型可能导致客户端承受较高的聚类成本，从而不愿参与联邦过程。为此，我们首次在联邦聚类中引入个性化理念，旨在实现较低聚类成本与客户端间成本均匀化之间的平衡。我们提出 p-FClus 方法，通过服务器与客户端单轮通信实现这些目标。我们在多种联邦数据集上验证了 p-FClus 的有效性，证明其具有数据独立性、适用于任意有限 $\ell$ 范数，同时能够实现更低的成本与方差。