Generating differentially private (DP) synthetic data that closely resembles the original private data without leaking sensitive user information is a scalable way to mitigate privacy concerns in the current data-driven world. In contrast to current practices that train customized models for this task, we aim to generate DP Synthetic Data via APIs (DPSDA), where we treat foundation models as blackboxes and only utilize their inference APIs. Such API-based, training-free approaches are easier to deploy as exemplified by the recent surge in the number of API-based apps. These approaches can also leverage the power of large foundation models which are accessible via their inference APIs while the model weights are unreleased. However, this comes with greater challenges due to strictly more restrictive model access and the additional need to protect privacy from the API provider. In this paper, we present a new framework called Private Evolution (PE) to solve this problem and show its initial promise on synthetic images. Surprisingly, PE can match or even outperform state-of-the-art (SOTA) methods without any model training. For example, on CIFAR10 (with ImageNet as the public data), we achieve FID<=7.9 with privacy cost epsilon=0.67, significantly improving the previous SOTA from epsilon=32. We further demonstrate the promise of applying PE on large foundation models such as Stable Diffusion to tackle challenging private datasets with a small number of high-resolution images.
翻译:生成与原始私有数据高度相似且不泄露敏感用户信息的差分隐私合成数据,是当前数据驱动世界中缓解隐私问题的可扩展方法。与当前为该任务训练定制化模型的做法不同,我们旨在通过API生成差分隐私合成数据(DPSDA),将基础模型视为黑盒并仅利用其推理API。这种基于API、无需训练的部署方式更为简便,近期基于API的应用激增即为例证。此类方法还能利用通过推理API访问但权重未公开的大型基础模型的能力。然而,由于模型访问权限更为严格且需额外防范API提供商的隐私泄露,这一方法面临更大挑战。本文提出名为“私有演化(PE)”的新框架来解决该问题,并在合成图像上展示了其初步潜力。令人惊讶的是,PE无需任何模型训练即可媲美甚至超越现有最优方法(SOTA)。例如,在CIFAR10数据集上(以ImageNet作为公共数据),我们在隐私代价ε=0.67时实现了FID≤7.9,较此前SOTA所需的ε=32取得显著提升。我们进一步展示了将PE应用于Stable Diffusion等大型基础模型,以处理包含少量高分辨率图像的具有挑战性的私有数据集的潜力。