Data Pruning in Generative Diffusion Models

Data pruning is the problem of identifying a core subset that is most beneficial to training and discarding the remainder. While pruning strategies are well studied for discriminative models like those used in classification, little research has gone into their application to generative models. Generative models aim to estimate the underlying distribution of the data, so presumably they should benefit from larger datasets. In this work we aim to shed light on the accuracy of this statement, specifically answer the question of whether data pruning for generative diffusion models could have a positive impact. Contrary to intuition, we show that eliminating redundant or noisy data in large datasets is beneficial particularly when done strategically. We experiment with several pruning methods including recent-state-of-art methods, and evaluate over CelebA-HQ and ImageNet datasets. We demonstrate that a simple clustering method outperforms other sophisticated and computationally demanding methods. We further exhibit how we can leverage clustering to balance skewed datasets in an unsupervised manner to allow fair sampling for underrepresented populations in the data distribution, which is a crucial problem in generative models.

翻译：数据剪枝旨在识别对训练最有益的核心子集并舍弃其余数据。尽管针对分类等判别式模型的剪枝策略已有深入研究，但将其应用于生成式模型的研究却很少。生成式模型旨在估计数据的底层分布，因此理论上应能从更大数据集中受益。本研究旨在阐明这一论断的准确性，具体回答数据剪枝对生成式扩散模型是否可能产生积极影响。与直觉相反，我们证明在大数据集中策略性地剔除冗余或噪声数据是有益的。我们实验了包括最新先进方法在内的多种剪枝方法，并在CelebA-HQ和ImageNet数据集上进行评估。结果表明，简单的聚类方法优于其他复杂且计算成本高昂的方法。我们进一步展示了如何利用聚类以无监督方式平衡倾斜数据集，从而实现对数据分布中代表性不足群体的公平采样，这是生成式模型中的一个关键问题。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日