Data pruning, which aims to downsize a large training set into a small informative subset, is crucial for reducing the enormous computational costs of modern deep learning. Though large-scale data collections invariably contain annotation noise and numerous robust learning methods have been developed, data pruning for the noise-robust learning scenario has received little attention. With state-of-the-art Re-labeling methods that self-correct erroneous labels while training, it is challenging to identify which subset induces the most accurate re-labeling of erroneous labels in the entire training set. In this paper, we formalize the problem of data pruning with re-labeling. We first show that the likelihood of a training example being correctly re-labeled is proportional to the prediction confidence of its neighborhood in the subset. Therefore, we propose a novel data pruning algorithm, Prune4Rel, that finds a subset maximizing the total neighborhood confidence of all training examples, thereby maximizing the re-labeling accuracy and generalization performance. Extensive experiments on four real and one synthetic noisy datasets show that \algname{} outperforms the baselines with Re-labeling models by up to 9.1% as well as those with a standard model by up to 21.6%.
翻译:数据剪枝旨在将大规模训练集缩减为小型信息子集,对于降低现代深度学习的巨大计算成本至关重要。尽管大规模数据收集难免包含标注噪声,且已有大量鲁棒学习方法被提出,但针对噪声鲁棒学习场景的数据剪枝问题却鲜受关注。当采用最先进的重标注方法在训练过程中自我纠正错误标签时,如何识别能够促使整个训练集中错误标签获得最精确重标注的子集成为一大挑战。本文首次形式化定义了带重标注的数据剪枝问题。我们首先证明,训练样本被正确重标注的可能性与其在子集中的邻域预测置信度成正比。为此,我们提出新型数据剪枝算法Prune4Rel,通过寻找能最大化所有训练样本总邻域置信度的子集,进而实现重标注精度与泛化性能的最大化。在四个真实噪声数据集与一个合成噪声数据集上的大量实验表明,相比基线方法,Prune4Rel在重标注模型上提升达9.1%,在标准模型上提升达21.6%。