Federated Clustering: An Unsupervised Cluster-Wise Training for Decentralized Data Distributions

Federated Learning (FL) is a pivotal approach in decentralized machine learning, especially when data privacy is crucial and direct data sharing is impractical. While FL is typically associated with supervised learning, its potential in unsupervised scenarios is underexplored. This paper introduces a novel unsupervised federated learning methodology designed to identify the complete set of categories (global K) across multiple clients within label-free, non-uniform data distributions, a process known as Federated Clustering. Our approach, Federated Cluster-Wise Refinement (FedCRef), involves clients that collaboratively train models on clusters with similar data distributions. Initially, clients with diverse local data distributions (local K) train models on their clusters to generate compressed data representations. These local models are then shared across the network, enabling clients to compare them through reconstruction error analysis, leading to the formation of federated groups.In these groups, clients collaboratively train a shared model representing each data distribution, while continuously refining their local clusters to enhance data association accuracy. This iterative process allows our system to identify all potential data distributions across the network and develop robust representation models for each. To validate our approach, we compare it with traditional centralized methods, establishing a performance baseline and showcasing the advantages of our distributed solution. We also conduct experiments on the EMNIST and KMNIST datasets, demonstrating FedCRef's ability to refine and align cluster models with actual data distributions, significantly improving data representation precision in unsupervised federated settings.

翻译：联邦学习（FL）是去中心化机器学习中的关键方法，在数据隐私至关重要且直接数据共享不可行时尤为有效。尽管联邦学习通常与监督学习相关联，但其在无监督场景中的潜力尚未得到充分探索。本文提出了一种新颖的无监督联邦学习方法，旨在无标签的非均匀数据分布环境下，跨多个客户端识别完整的类别集合（全局K），这一过程被称为联邦聚类。我们的方法——联邦集群式优化（FedCRef），通过数据分布相似的集群实现客户端间的协同模型训练。首先，具有不同本地数据分布（本地K）的客户端在其集群上训练模型以生成压缩数据表征。随后，这些本地模型通过网络共享，使客户端能够通过重构误差分析进行比较，从而形成联邦集群组。在这些集群组中，客户端协同训练代表各数据分布的共享模型，同时持续优化其本地集群以提升数据关联精度。这一迭代过程使我们的系统能够识别网络中所有潜在的数据分布，并为每种分布建立鲁棒的表征模型。为验证该方法，我们将其与传统集中式方法进行对比，建立性能基线并展示分布式解决方案的优势。我们在EMNIST和KMNIST数据集上进行了实验，证明FedCRef能够优化集群模型并与实际数据分布对齐，显著提升无监督联邦场景下的数据表征精度。