Selecting Samples on Graphs: A Unified Dataset Pruning Framework for Lossless Training Acceleration

The rapid growth of modern training datasets has significantly increased computational cost, motivating dataset pruning~(DP) methods which retain only a subset of informative samples to reduce training cost. Existing pruning criteria typically rely on either intrinsic signals that assess samples independently or extrinsic signals that promote diversity via pairwise relations. While effective in their own specific regimes, each captures only one aspect of sample utility and lacks robustness across different pruning ratios or data distribution. In this work, we present a unified graph-based DP framework. By modeling the dataset as a weighted graph, where node weights encode intrinsic value and edge weights encode extrinsic value, DP can be cast as a Maximum Weight Clique Problem (MWCP). Although MWCP is NP-hard, its structure admits a principled greedy solution based on sample-wise marginal gains. Under a few mild conditions, we further prove that this unified objective enjoys a formal approximation guarantee, which applies to a broad family of importance metrics and provides practical design guidelines. Extensive experiments show that our method outperforms existing DP methods while substantially reducing training cost, reducing training time by over 40\% without sacrificing accuracy on ImageNet-1k with ResNet-50.

翻译：现代训练数据集的快速增长显著提高了计算成本，这促使了数据集剪枝（DP）方法的产生——仅保留信息量丰富的样本子集以降低训练开销。现有剪枝准则通常依赖于两类信号：评估样本独立价值的本征信号，或通过成对关系促进多样性的外在信号。尽管这些方法在特定场景下有效，但它们仅捕捉了样本效用的单一维度，在不同剪枝比例或数据分布下缺乏鲁棒性。本文提出一种统一的基于图的DP框架。将数据集建模为带权图后，节点权重编码本征价值，边权重编码外在价值，数据集剪枝可转化为最大权团问题（MWCP）。尽管MWCP属于NP难问题，其结构特性允许基于样本边际增益的贪婪求解方法。在若干温和假设下，我们进一步证明该统一目标具有理论近似保证，该保证适用于一大类重要性度量指标，并为实际设计提供指导原则。大量实验表明，我们的方法在显著降低训练成本的同时优于现有DP方法——在ImageNet-1k数据集上结合ResNet-50架构，可在不牺牲精度的前提下将训练时间缩短超过40%。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2024】GDeR: 通过原型图剪枝保障效率、平衡性与鲁棒性

专知会员服务

15+阅读 · 2024年10月21日

【CVPR2024】通过可学习智能体指导和对齐共同训练和剪枝CNNs

专知会员服务

20+阅读 · 2024年3月29日

【ICML2023】调整语言模型作为增强少样本学习的训练数据生成器

专知会员服务

32+阅读 · 2023年5月19日