Analytical theories suggest that higher-quality data can lead to lower test errors in models trained on a fixed data budget. Moreover, a model can be trained on a lower compute budget without compromising performance if a dataset can be stripped of its redundancies. Coreset selection (or data pruning) seeks to select a subset of the training data so as to maximize the performance of models trained on this subset, also referred to as coreset. There are two dominant approaches: (1) geometry-based data selection for maximizing data diversity in the coreset, and (2) functions that assign difficulty scores to samples based on training dynamics. Optimizing for data diversity leads to a coreset that is biased towards easier samples, whereas, selection by difficulty ranking omits easy samples that are necessary for the training of deep learning models. This demonstrates that data diversity and importance scores are two complementary factors that need to be jointly considered during coreset selection. We represent a dataset as an undirected graph and propose a novel pruning algorithm, D2 Pruning, that uses forward and reverse message passing over this dataset graph for coreset selection. D2 Pruning updates the difficulty scores of each example by incorporating the difficulty of its neighboring examples in the dataset graph. Then, these updated difficulty scores direct a graph-based sampling method to select a coreset that encapsulates both diverse and difficult regions of the dataset space. We evaluate supervised and self-supervised versions of our method on various vision and language datasets. Results show that D2 Pruning improves coreset selection over previous state-of-the-art methods for up to 70% pruning rates. Additionally, we find that using D2 Pruning for filtering large multimodal datasets leads to increased diversity in the dataset and improved generalization of pretrained models.
翻译:理论分析表明,在固定数据预算下训练模型时,更高质量的数据可降低测试误差。此外,若能消除数据集中的冗余信息,则可在不牺牲模型性能的前提下降低计算预算。核心集选择(或称数据剪枝)旨在从训练数据中选取子集,以最大化在该子集上训练的模型性能——该子集即称为核心集。目前存在两种主流方法:(1) 基于几何特征的数据选择,以最大化核心集的数据多样性;(2) 基于训练动态为样本分配难度分数的函数方法。优化数据多样性会导致核心集偏向于简单样本,而基于难度排序的选择则遗漏了深度学习训练所必需的简单样本。这表明数据多样性与重要性分数是核心集选择过程中需联合考虑的两个互补因素。我们将数据集表示为无向图,并提出新型剪枝算法D²剪枝,该算法通过在此数据集图上进行前向与反向消息传递实现核心集选择。D²剪枝通过整合数据集图中邻近样本的难度信息来更新每个样本的难度分数,进而利用更新后的难度分数指导基于图的采样方法,选取能够同时覆盖数据集空间中多样且困难区域的核心集。我们在多种视觉与语言数据集上评估了该方法的监督与自监督版本。结果表明,在高达70%的剪枝率下,D²剪枝相较于先前最先进方法显著提升了核心集选择效果。此外,我们发现使用D²剪枝过滤大规模多模态数据集可增强数据集多样性,并改善预训练模型的泛化能力。