Differentially private synthetic data provide a powerful mechanism to enable data analysis while protecting sensitive information about individuals. However, when the data lie in a high-dimensional space, the accuracy of the synthetic data suffers from the curse of dimensionality. In this paper, we propose a differentially private algorithm to generate low-dimensional synthetic data efficiently from a high-dimensional dataset with a utility guarantee with respect to the Wasserstein distance. A key step of our algorithm is a private principal component analysis (PCA) procedure with a near-optimal accuracy bound that circumvents the curse of dimensionality. Different from the standard perturbation analysis using the Davis-Kahan theorem, our analysis of private PCA works without assuming the spectral gap for the sample covariance matrix.
翻译:差分隐私合成数据提供了一种强大的机制,能够在分析数据的同时保护个人敏感信息。然而,当数据位于高维空间时,合成数据的准确性会遭受维度灾难。本文提出了一种差分隐私算法,能够从高维数据集中高效生成具有Wasserstein距离效用保证的低维合成数据。该算法的关键步骤是采用一种具有近似最优精度上界的私有主成分分析(PCA)过程,从而避免维度灾难。与使用Davis-Kahan定理的标准扰动分析不同,我们的私有PCA分析无需假设样本协方差矩阵存在谱间隙。