In this paper, we investigate federated clustering (FedC) problem, that aims to accurately partition unlabeled data samples distributed over massive clients into finite clusters under the orchestration of a parameter server, meanwhile considering data privacy. Though it is an NP-hard optimization problem involving real variables denoting cluster centroids and binary variables denoting the cluster membership of each data sample, we judiciously reformulate the FedC problem into a non-convex optimization problem with only one convex constraint, accordingly yielding a soft clustering solution. Then a novel FedC algorithm using differential privacy (DP) technique, referred to as DP-FedC, is proposed in which partial clients participation and multiple local model updating steps are also considered. Furthermore, various attributes of the proposed DP-FedC are obtained through theoretical analyses of privacy protection and convergence rate, especially for the case of non-identically and independently distributed (non-i.i.d.) data, that ideally serve as the guidelines for the design of the proposed DP-FedC. Then some experimental results on two real datasets are provided to demonstrate the efficacy of the proposed DP-FedC together with its much superior performance over some state-of-the-art FedC algorithms, and the consistency with all the presented analytical results.
翻译:本文研究了联邦聚类问题,其目标是在参数服务器的协调下,将分布在大量客户端上的未标记数据样本精确划分为有限个聚类,同时考虑数据隐私。尽管这是一个涉及表示聚类中心的实变量和表示每个数据样本聚类隶属关系的二值变量的NP难优化问题,但我们巧妙地将联邦聚类问题重新表述为仅含一个凸约束的非凸优化问题,从而得到软聚类解。随后提出了一种采用差分隐私技术的新型联邦聚类算法,称为DP-FedC,该算法还考虑了部分客户端参与和多次局部模型更新步骤。此外,通过隐私保护和收敛速度的理论分析,本文推导了所提DP-FedC算法的各项属性,特别是针对非独立同分布数据的情况,这些分析结果可作为所提DP-FedC算法设计的指导原则。最后,在两个真实数据集上的实验结果证明了所提DP-FedC算法的有效性,其性能显著优于现有最优的联邦聚类算法,且与所有分析结果保持一致。