Maximum mean discrepancy (MMD) is a particularly useful distance metric for differentially private data generation: when used with finite-dimensional features it allows us to summarize and privatize the data distribution once, which we can repeatedly use during generator training without further privacy loss. An important question in this framework is, then, what features are useful to distinguish between real and synthetic data distributions, and whether those enable us to generate quality synthetic data. This work considers the using the features of $\textit{neural tangent kernels (NTKs)}$, more precisely $\textit{empirical}$ NTKs (e-NTKs). We find that, perhaps surprisingly, the expressiveness of the untrained e-NTK features is comparable to that of the features taken from pre-trained perceptual features using public data. As a result, our method improves the privacy-accuracy trade-off compared to other state-of-the-art methods, without relying on any public data, as demonstrated on several tabular and image benchmark datasets.
翻译:最大平均差异(MMD)是一种对差分隐私数据生成特别有用的距离度量:当与有限维特征结合时,它允许我们一次性总结并私有化数据分布,并在生成器训练过程中反复使用而不会产生进一步的隐私损失。在此框架中,一个关键问题是:哪些特征有助于区分真实数据分布与合成数据分布,以及这些特征是否能够生成高质量的合成数据。本文考虑使用神经切线核(NTK)的特征,更精确地说,是经验神经切线核(e-NTK)的特征。我们发现,或许令人惊讶的是,未训练的e-NTK特征的表达能力与使用公开数据预训练的感知特征相当。因此,我们的方法在多个表格和图像基准数据集上展现出相较于其他最先进方法更好的隐私-精度权衡,且无需依赖任何公开数据。