A Geometric Framework for Understanding Memorization in Generative Models

As deep generative models have progressed, recent work has shown them to be capable of memorizing and reproducing training datapoints when deployed. These findings call into question the usability of generative models, especially in light of the legal and privacy risks brought about by memorization. To better understand this phenomenon, we propose the manifold memorization hypothesis (MMH), a geometric framework which leverages the manifold hypothesis into a clear language in which to reason about memorization. We propose to analyze memorization in terms of the relationship between the dimensionalities of (i) the ground truth data manifold and (ii) the manifold learned by the model. This framework provides a formal standard for "how memorized" a datapoint is and systematically categorizes memorized data into two types: memorization driven by overfitting and memorization driven by the underlying data distribution. By analyzing prior work in the context of the MMH, we explain and unify assorted observations in the literature. We empirically validate the MMH using synthetic data and image datasets up to the scale of Stable Diffusion, developing new tools for detecting and preventing generation of memorized samples in the process.

翻译：随着深度生成模型的发展，近期研究表明，这些模型在部署时能够记忆并复现训练数据点。这些发现对生成模型的可用性提出了质疑，尤其是在考虑到记忆行为所带来的法律和隐私风险时。为了更好地理解这一现象，我们提出了流形记忆假说（MMH）——一个将流形假说转化为清晰推理语言的几何框架，用于系统分析记忆现象。我们建议从以下两个维度的关系角度分析记忆行为：（i）真实数据流形的维度，以及（ii）模型学习到的流形维度。该框架为"数据被记忆的程度"提供了形式化标准，并将记忆数据系统性地分为两类：由过拟合驱动的记忆和由底层数据分布驱动的记忆。通过在MMH框架下分析现有研究，我们解释并统一了文献中的各类观察结果。我们使用合成数据和图像数据集（规模达到Stable Diffusion级别）对MMH进行了实证验证，并在此过程中开发了用于检测和防止生成记忆样本的新工具。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/