The rapid growth of modern training datasets has significantly increased computational cost, motivating dataset pruning~(DP) methods which retain only a subset of informative samples to reduce training cost. Existing pruning criteria typically rely on either intrinsic signals that assess samples independently or extrinsic signals that promote diversity via pairwise relations. While effective in their own specific regimes, each captures only one aspect of sample utility and lacks robustness across different pruning ratios or data distribution. In this work, we present a unified graph-based DP framework. By modeling the dataset as a weighted graph, where node weights encode intrinsic value and edge weights encode extrinsic value, DP can be cast as a Maximum Weight Clique Problem (MWCP). Although MWCP is NP-hard, its structure admits a principled greedy solution based on sample-wise marginal gains. Under a few mild conditions, we further prove that this unified objective enjoys a formal approximation guarantee, which applies to a broad family of importance metrics and provides practical design guidelines. Extensive experiments show that our method outperforms existing DP methods while substantially reducing training cost, reducing training time by over 40\% without sacrificing accuracy on ImageNet-1k with ResNet-50.
翻译:现代训练数据集的快速增长显著提高了计算成本,这促使了数据集剪枝(DP)方法的产生——仅保留信息量丰富的样本子集以降低训练开销。现有剪枝准则通常依赖于两类信号:评估样本独立价值的本征信号,或通过成对关系促进多样性的外在信号。尽管这些方法在特定场景下有效,但它们仅捕捉了样本效用的单一维度,在不同剪枝比例或数据分布下缺乏鲁棒性。本文提出一种统一的基于图的DP框架。将数据集建模为带权图后,节点权重编码本征价值,边权重编码外在价值,数据集剪枝可转化为最大权团问题(MWCP)。尽管MWCP属于NP难问题,其结构特性允许基于样本边际增益的贪婪求解方法。在若干温和假设下,我们进一步证明该统一目标具有理论近似保证,该保证适用于一大类重要性度量指标,并为实际设计提供指导原则。大量实验表明,我们的方法在显著降低训练成本的同时优于现有DP方法——在ImageNet-1k数据集上结合ResNet-50架构,可在不牺牲精度的前提下将训练时间缩短超过40%。