Recent success in deep learning has partially been driven by training increasingly overparametrized networks on ever larger datasets. It is therefore natural to ask: how much of the data is superfluous, which examples are important for generalization, and how do we find them? In this work, we make the striking observation that, in standard vision datasets, simple scores averaged over several weight initializations can be used to identify important examples very early in training. We propose two such scores -- the Gradient Normed (GraNd) and the Error L2-Norm (EL2N) scores -- and demonstrate their efficacy on a range of architectures and datasets by pruning significant fractions of training data without sacrificing test accuracy. In fact, using EL2N scores calculated a few epochs into training, we can prune half of the CIFAR10 training set while slightly improving test accuracy. Furthermore, for a given dataset, EL2N scores from one architecture or hyperparameter configuration generalize to other configurations. Compared to recent work that prunes data by discarding examples that are rarely forgotten over the course of training, our scores use only local information early in training. We also use our scores to detect noisy examples and study training dynamics through the lens of important examples -- we investigate how the data distribution shapes the loss surface and identify subspaces of the model's data representation that are relatively stable over training.
翻译:近年来深度学习的成功很大程度上源于在日益增大的数据集上训练过度参数化的网络。因此自然产生疑问:有多少数据是冗余的?哪些样本对泛化至关重要?又如何识别它们?本研究提出一个惊人发现:在标准视觉数据集中,通过对不同权重初始化下的简单评分取平均,可在训练极早期识别出重要样本。我们设计了两种评分指标——梯度范数分数(GraNd)与误差L2范数分数(EL2N)——并在多种架构和数据集上验证其有效性:在不牺牲测试准确率的前提下,可剪除大量训练数据。事实上,仅需训练数个epoch后计算的EL2N分数,即可剪除CIFAR10训练集半数样本,同时略微提升测试准确率。此外,对于给定数据集,某架构或超参数配置下的EL2N评分可泛化至其他配置。相较于近期工作中通过丢弃训练中极少被遗忘的样本来剪枝数据的方法,我们的评分仅需训练早期的局部信息。我们也利用该评分检测噪声样本,并通过重要样本视角研究训练动态——探究数据分布如何塑造损失曲面,并识别出训练过程中相对稳定的模型数据表示子空间。