Data pruning is the problem of identifying a core subset that is most beneficial to training and discarding the remainder. While pruning strategies are well studied for discriminative models like those used in classification, little research has gone into their application to generative models. Generative models aim to estimate the underlying distribution of the data, so presumably they should benefit from larger datasets. In this work we aim to shed light on the accuracy of this statement, specifically answer the question of whether data pruning for generative diffusion models could have a positive impact. Contrary to intuition, we show that eliminating redundant or noisy data in large datasets is beneficial particularly when done strategically. We experiment with several pruning methods including recent-state-of-art methods, and evaluate over CelebA-HQ and ImageNet datasets. We demonstrate that a simple clustering method outperforms other sophisticated and computationally demanding methods. We further exhibit how we can leverage clustering to balance skewed datasets in an unsupervised manner to allow fair sampling for underrepresented populations in the data distribution, which is a crucial problem in generative models.
翻译:数据剪枝旨在识别对训练最有益的核心子集并舍弃其余数据。尽管针对分类等判别式模型的剪枝策略已有深入研究,但将其应用于生成式模型的研究却很少。生成式模型旨在估计数据的底层分布,因此理论上应能从更大数据集中受益。本研究旨在阐明这一论断的准确性,具体回答数据剪枝对生成式扩散模型是否可能产生积极影响。与直觉相反,我们证明在大数据集中策略性地剔除冗余或噪声数据是有益的。我们实验了包括最新先进方法在内的多种剪枝方法,并在CelebA-HQ和ImageNet数据集上进行评估。结果表明,简单的聚类方法优于其他复杂且计算成本高昂的方法。我们进一步展示了如何利用聚类以无监督方式平衡倾斜数据集,从而实现对数据分布中代表性不足群体的公平采样,这是生成式模型中的一个关键问题。