In this paper, we investigate federated clustering (FedC) problem, that aims to accurately partition unlabeled data samples distributed over massive clients into finite clusters under the orchestration of a parameter server, meanwhile considering data privacy. Though it is an NP-hard optimization problem involving real variables denoting cluster centroids and binary variables denoting the cluster membership of each data sample, we judiciously reformulate the FedC problem into a non-convex optimization problem with only one convex constraint, accordingly yielding a soft clustering solution. Then a novel FedC algorithm using differential privacy (DP) technique, referred to as DP-FedC, is proposed in which partial clients participation and multiple local model updating steps are also considered. Furthermore, various attributes of the proposed DP-FedC are obtained through theoretical analyses of privacy protection and convergence rate, especially for the case of non-identically and independently distributed (non-i.i.d.) data, that ideally serve as the guidelines for the design of the proposed DP-FedC. Then some experimental results on two real datasets are provided to demonstrate the efficacy of the proposed DP-FedC together with its much superior performance over some state-of-the-art FedC algorithms, and the consistency with all the presented analytical results.
翻译:本文研究了联邦聚类(FedC)问题,旨在参数服务器协调下精准划分分布在大量客户端上的未标注数据样本为有限簇,同时兼顾数据隐私。尽管该问题涉及表示聚类中心的实变量和表示每个数据样本聚类隶属关系的二值变量,属于NP-hard优化问题,但我们巧妙地将FedC问题重新表述为仅含一个凸约束的非凸优化问题,从而得到软聚类解。随后提出一种采用差分隐私(DP)技术的新型FedC算法,称为DP-FedC,其中考虑了部分客户端参与和多次本地模型更新步骤。此外,通过隐私保护和收敛速度的理论分析,特别是针对非独立同分布(non-i.i.d.)数据场景,获得了所提DP-FedC的多种属性,这些分析结果恰可作为DP-FedC设计的指导准则。最后在两个真实数据集上的实验结果表明,所提DP-FedC算法具有显著优越性,性能远超现有最优FedC算法,且与所有理论分析结果一致。