Federated Clustering (FC) is an emerging and promising solution in exploring data distribution patterns from distributed and privacy-protected data in an unsupervised manner. Existing FC methods implicitly rely on the assumption that clients are with a known number of uniformly sized clusters. However, the true number of clusters is typically unknown, and cluster sizes are naturally imbalanced in real scenarios. Furthermore, the privacy-preserving transmission constraints in federated learning inevitably reduce usable information, making the development of robust and accurate FC extremely challenging. Accordingly, we propose a novel FC framework named Fed-$k^*$-HC, which can automatically determine an optimal number of clusters $k^*$ based on the data distribution explored through hierarchical clustering. To obtain the global data distribution for $k^*$ determination, we let each client generate micro-subclusters. Their prototypes are then uploaded to the server for hierarchical merging. The density-based merging design allows exploring clusters of varying sizes and shapes, and the progressive merging process can self-terminate according to the neighboring relationships among the prototypes to determine $k^*$. Extensive experiments on diverse datasets demonstrate the FC capability of the proposed Fed-$k^*$-HC in accurately exploring a proper number of clusters.
翻译:联邦聚类是一种新兴且有前景的解决方案,旨在以无监督方式探索分布式且受隐私保护的数据分布模式。现有联邦聚类方法隐含地依赖于一个假设,即客户端具有已知数量且大小均匀的簇。然而,在实际场景中,真实的簇数量通常是未知的,且簇大小天然不平衡。此外,联邦学习中保护隐私的传输约束不可避免地减少了可用信息,使得开发鲁棒且准确的联邦聚类极具挑战性。为此,我们提出了一种名为 Fed-$k^*$-HC 的新型联邦聚类框架,该框架能够基于通过层次聚类探索的数据分布自动确定最优簇数量 $k^*$。为获得用于确定 $k^*$ 的全局数据分布,我们让每个客户端生成微子簇,并将其原型上传至服务器进行层次合并。基于密度的合并设计允许探索不同大小和形状的簇,且渐进式合并过程可根据原型间的邻近关系自行终止,从而确定 $k^*$。在多样化数据集上的大量实验证明了所提出的 Fed-$k^*$-HC 在准确探索适当簇数量方面的联邦聚类能力。