Differentially private synthetic data provide a powerful mechanism to enable data analysis while protecting sensitive information about individuals. However, when the data lie in a high-dimensional space, the accuracy of the synthetic data suffers from the curse of dimensionality. In this paper, we propose a differentially private algorithm to generate low-dimensional synthetic data efficiently from a high-dimensional dataset with a utility guarantee with respect to the Wasserstein distance. A key step of our algorithm is a private principal component analysis (PCA) procedure with a near-optimal accuracy bound that circumvents the curse of dimensionality. Unlike the standard perturbation analysis, our analysis of private PCA works without assuming the spectral gap for the covariance matrix.
翻译:差分隐私合成数据为实现数据分析同时保护个体敏感信息提供了有力机制。然而,当数据位于高维空间时,合成数据的准确性会受到维度灾难的影响。本文提出一种差分隐私算法,可从高维数据集中高效生成低维合成数据,并在Wasserstein距离度量下提供效用保证。该算法的关键步骤是采用具有近最优精度边界的隐私主成分分析(PCA)流程,从而规避维度灾难。与标准扰动分析不同,我们的隐私PCA分析无需假设协方差矩阵存在谱间隙。