Differentially Private Synthetic Data via Foundation Model APIs 1: Images

Generating differentially private (DP) synthetic data that closely resembles the original private data without leaking sensitive user information is a scalable way to mitigate privacy concerns in the current data-driven world. In contrast to current practices that train customized models for this task, we aim to generate DP Synthetic Data via APIs (DPSDA), where we treat foundation models as blackboxes and only utilize their inference APIs. Such API-based, training-free approaches are easier to deploy as exemplified by the recent surge in the number of API-based apps. These approaches can also leverage the power of large foundation models which are accessible via their inference APIs while the model weights are unreleased. However, this comes with greater challenges due to strictly more restrictive model access and the additional need to protect privacy from the API provider. In this paper, we present a new framework called Private Evolution (PE) to solve this problem and show its initial promise on synthetic images. Surprisingly, PE can match or even outperform state-of-the-art (SOTA) methods without any model training. For example, on CIFAR10 (with ImageNet as the public data), we achieve FID<=7.9 with privacy cost epsilon=0.67, significantly improving the previous SOTA from epsilon=32. We further demonstrate the promise of applying PE on large foundation models such as Stable Diffusion to tackle challenging private datasets with a small number of high-resolution images.

翻译：生成与原始私有数据高度相似且不泄露敏感用户信息的差分隐私合成数据，是当前数据驱动世界中缓解隐私问题的可扩展方法。与当前为该任务训练定制化模型的做法不同，我们旨在通过API生成差分隐私合成数据（DPSDA），将基础模型视为黑盒并仅利用其推理API。这种基于API、无需训练的部署方式更为简便，近期基于API的应用激增即为例证。此类方法还能利用通过推理API访问但权重未公开的大型基础模型的能力。然而，由于模型访问权限更为严格且需额外防范API提供商的隐私泄露，这一方法面临更大挑战。本文提出名为“私有演化（PE）”的新框架来解决该问题，并在合成图像上展示了其初步潜力。令人惊讶的是，PE无需任何模型训练即可媲美甚至超越现有最优方法（SOTA）。例如，在CIFAR10数据集上（以ImageNet作为公共数据），我们在隐私代价ε=0.67时实现了FID≤7.9，较此前SOTA所需的ε=32取得显著提升。我们进一步展示了将PE应用于Stable Diffusion等大型基础模型，以处理包含少量高分辨率图像的具有挑战性的私有数据集的潜力。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

105+阅读 · 2022年2月10日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日