Extracting Training Data from Unconditional Diffusion Models

As diffusion probabilistic models (DPMs) are being employed as mainstream models for generative artificial intelligence (AI), the study of their memorization of the raw training data has attracted growing attention. Existing works in this direction aim to establish an understanding of whether or to what extent DPMs learn by memorization. Such an understanding is crucial for identifying potential risks of data leakage and copyright infringement in diffusion models and, more importantly, for more controllable generation and trustworthy application of Artificial Intelligence Generated Content (AIGC). While previous works have made important observations of when DPMs are prone to memorization, these findings are mostly empirical, and the developed data extraction methods only work for conditional diffusion models. In this work, we aim to establish a theoretical understanding of memorization in DPMs with 1) a memorization metric for theoretical analysis, 2) an analysis of conditional memorization with informative and random labels, and 3) two better evaluation metrics for measuring memorization. Based on the theoretical analysis, we further propose a novel data extraction method called \textbf{Surrogate condItional Data Extraction (SIDE)} that leverages a classifier trained on generated data as a surrogate condition to extract training data directly from unconditional diffusion models. Our empirical results demonstrate that SIDE can extract training data from diffusion models where previous methods fail, and it is on average over 50\% more effective across different scales of the CelebA dataset.

翻译：随着扩散概率模型（DPMs）成为生成式人工智能（AI）的主流模型，对其记忆原始训练数据的研究日益受到关注。该领域的现有工作旨在理解DPMs是否通过记忆学习，以及记忆的程度。这种理解对于识别扩散模型中数据泄露和版权侵权的潜在风险至关重要，更重要的是，对于实现人工智能生成内容（AIGC）的更可控生成和可信应用具有重要意义。尽管先前的工作对DPMs何时易于记忆做出了重要观察，但这些发现大多是经验性的，且已开发的数据提取方法仅适用于条件扩散模型。在本工作中，我们旨在建立对DPMs记忆的理论理解，包括：1）用于理论分析的记忆度量指标，2）对具有信息性标签和随机标签的条件记忆的分析，以及3）两种用于衡量记忆的更好评估指标。基于理论分析，我们进一步提出了一种名为\textbf{代理条件数据提取（SIDE）}的新数据提取方法，该方法利用在生成数据上训练的分类器作为代理条件，直接从无条件扩散模型中提取训练数据。我们的实证结果表明，SIDE能够从先前方法失败的扩散模型中提取训练数据，并且在CelebA数据集的不同规模上，其平均有效性提高了50%以上。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日