Federated Learning (FL) is a pivotal approach in decentralized machine learning, especially when data privacy is crucial and direct data sharing is impractical. While FL is typically associated with supervised learning, its potential in unsupervised scenarios is underexplored. This paper introduces a novel unsupervised federated learning methodology designed to identify the complete set of categories (global K) across multiple clients within label-free, non-uniform data distributions, a process known as Federated Clustering. Our approach, Federated Cluster-Wise Refinement (FedCRef), involves clients that collaboratively train models on clusters with similar data distributions. Initially, clients with diverse local data distributions (local K) train models on their clusters to generate compressed data representations. These local models are then shared across the network, enabling clients to compare them through reconstruction error analysis, leading to the formation of federated groups.In these groups, clients collaboratively train a shared model representing each data distribution, while continuously refining their local clusters to enhance data association accuracy. This iterative process allows our system to identify all potential data distributions across the network and develop robust representation models for each. To validate our approach, we compare it with traditional centralized methods, establishing a performance baseline and showcasing the advantages of our distributed solution. We also conduct experiments on the EMNIST and KMNIST datasets, demonstrating FedCRef's ability to refine and align cluster models with actual data distributions, significantly improving data representation precision in unsupervised federated settings.
翻译:联邦学习(FL)是去中心化机器学习中的关键方法,在数据隐私至关重要且直接数据共享不可行时尤为有效。尽管联邦学习通常与监督学习相关联,但其在无监督场景中的潜力尚未得到充分探索。本文提出一种新颖的无监督联邦学习方法,旨在无标签、非均匀分布的数据环境中,跨多个客户端识别完整的类别集合(全局K),这一过程称为联邦聚类。我们提出的联邦聚类式精炼方法(FedCRef)使具有相似数据分布的客户端能够在聚类上协同训练模型。首先,具有不同本地数据分布(本地K)的客户端在其聚类上训练模型以生成压缩数据表示。随后,这些本地模型通过网络共享,使客户端能够通过重构误差分析进行比较,从而形成联邦群组。在这些群组中,客户端协同训练代表每种数据分布的共享模型,同时持续精炼其本地聚类以提升数据关联精度。这一迭代过程使我们的系统能够识别网络中所有潜在的数据分布,并为每种分布建立鲁棒的表征模型。为验证该方法,我们将其与传统集中式方法进行对比,建立性能基线并展示分布式解决方案的优势。我们在EMNIST和KMNIST数据集上进行了实验,证明FedCRef能够精炼聚类模型并与实际数据分布对齐,显著提升了无监督联邦场景下的数据表示精度。