In many applications, multiple parties have private data regarding the same set of users but on disjoint sets of attributes, and a server wants to leverage the data to train a model. To enable model learning while protecting the privacy of the data subjects, we need vertical federated learning (VFL) techniques, where the data parties share only information for training the model, instead of the private data. However, it is challenging to ensure that the shared information maintains privacy while learning accurate models. To the best of our knowledge, the algorithm proposed in this paper is the first practical solution for differentially private vertical federated k-means clustering, where the server can obtain a set of global centers with a provable differential privacy guarantee. Our algorithm assumes an untrusted central server that aggregates differentially private local centers and membership encodings from local data parties. It builds a weighted grid as the synopsis of the global dataset based on the received information. Final centers are generated by running any k-means algorithm on the weighted grid. Our approach for grid weight estimation uses a novel, light-weight, and differentially private set intersection cardinality estimation algorithm based on the Flajolet-Martin sketch. To improve the estimation accuracy in the setting with more than two data parties, we further propose a refined version of the weights estimation algorithm and a parameter tuning strategy to reduce the final k-means utility to be close to that in the central private setting. We provide theoretical utility analysis and experimental evaluation results for the cluster centers computed by our algorithm and show that our approach performs better both theoretically and empirically than the two baselines based on existing techniques.
翻译:在许多应用中,多个参与方持有关于同一用户群体但属性集互不重叠的私有数据,而服务器希望利用这些数据训练模型。为了在保护数据主体隐私的同时实现模型学习,需要采用纵向联邦学习(VFL)技术,即数据参与方仅共享训练模型所需的信息,而非原始私有数据。然而,如何在保持隐私的同时确保共享信息能够学习到精确模型仍具挑战性。据我们所知,本文提出的算法是首个面向差分隐私纵向联邦k-means聚类的实用解决方案,使服务器能够获得一组具有可证明差分隐私保证的全局中心点。我们的算法假设存在一个不可信的中央服务器,该服务器聚合来自本地数据参与方的差分隐私本地中心点和成员编码,并基于接收信息构建加权网格作为全局数据集摘要。最终的中心点通过在加权网格上运行任意k-means算法生成。在网格权重估计方法中,我们采用了一种基于Flajolet-Martin草图的新型轻量级差分隐私集合交集基数估计算法。为提升参与方数量超过两个场景下的估计精度,我们进一步提出了权重估计算法的改进版本及参数调优策略,使最终k-means的实用性接近集中式隐私设置水平。我们提供了算法的理论实用性分析及对计算所得聚类中心的实验评估结果,表明本方法在理论与实证层面均优于基于现有技术的两种基准方法。