Maximum mean discrepancy (MMD) is a particularly useful distance metric for differentially private data generation: when used with finite-dimensional features it allows us to summarize and privatize the data distribution once, which we can repeatedly use during generator training without further privacy loss. An important question in this framework is, then, what features are useful to distinguish between real and synthetic data distributions, and whether those enable us to generate quality synthetic data. This work considers the using the features of $\textit{neural tangent kernels (NTKs)}$, more precisely $\textit{empirical}$ NTKs (e-NTKs). We find that, perhaps surprisingly, the expressiveness of the untrained e-NTK features is comparable to that of the features taken from pre-trained perceptual features using public data. As a result, our method improves the privacy-accuracy trade-off compared to other state-of-the-art methods, without relying on any public data, as demonstrated on several tabular and image benchmark datasets.
翻译:最大均值差异(MMD)是一种特别适用于差分隐私数据生成的距离度量:当与有限维特征结合使用时,它允许我们一次性总结并私有化数据分布,从而在生成器训练过程中可以反复使用该分布而不产生额外的隐私损失。因此,该框架中的一个重要问题是,哪些特征有助于区分真实数据分布和合成数据分布,以及这些特征是否能够使我们生成高质量的合成数据。本工作考虑使用神经正切核(NTKs)的特征,更准确地说,是经验NTK(e-NTK)的特征。我们发现,可能令人惊讶的是,未经训练的e-NTK特征的表达能力与使用公开数据预训练的感知特征相当。因此,我们的方法在隐私-准确性权衡上优于其他最先进的方法,且不依赖任何公开数据,这在多个表格数据和图像基准数据集上得到了验证。