With the increasing size of datasets used for training neural networks, data pruning becomes an attractive field of research. However, most current data pruning algorithms are limited in their ability to preserve accuracy compared to models trained on the full data, especially in high pruning regimes. In this paper we explore the application of data pruning while incorporating knowledge distillation (KD) when training on a pruned subset. That is, rather than relying solely on ground-truth labels, we also use the soft predictions from a teacher network pre-trained on the complete data. By integrating KD into training, we demonstrate significant improvement across datasets, pruning methods, and on all pruning fractions. We first establish a theoretical motivation for employing self-distillation to improve training on pruned data. Then, we empirically make a compelling and highly practical observation: using KD, simple random pruning is comparable or superior to sophisticated pruning methods across all pruning regimes. On ImageNet for example, we achieve superior accuracy despite training on a random subset of only 50% of the data. Additionally, we demonstrate a crucial connection between the pruning factor and the optimal knowledge distillation weight. This helps mitigate the impact of samples with noisy labels and low-quality images retained by typical pruning algorithms. Finally, we make an intriguing observation: when using lower pruning fractions, larger teachers lead to accuracy degradation, while surprisingly, employing teachers with a smaller capacity than the student's may improve results. Our code will be made available.
翻译:随着用于训练神经网络的数据集规模不断增长,数据剪枝成为具有吸引力的研究领域。然而,当前大多数数据剪枝算法在保持全数据训练模型精度方面能力有限,尤其是在高剪枝率场景下。本文探索在剪枝子集训练过程中融入知识蒸馏(KD)的数据剪枝应用方法,即不仅依赖真实标签,同时利用预训练于完整数据的教师网络生成的软预测。通过将知识蒸馏整合至训练流程,我们在不同数据集、剪枝方法及所有剪枝比例下均观察到显著性能提升。首先从理论层面论证了利用自蒸馏改进剪枝数据训练的动机,进而通过实验得出一个具有重要实用价值的发现:采用知识蒸馏时,简单随机剪枝在所有剪枝场景中均能达到与复杂剪枝方法相当甚至更优的效果。以ImageNet为例,即使仅使用50%数据的随机子集进行训练,仍能获得更优精度。此外,我们揭示了剪枝因子与最优知识蒸馏权重之间的关键关联,这有助于缓解典型剪枝算法保留的噪声标签及低质量图像样本带来的负面影响。最后观察到有趣现象:在低剪枝比例下,大型教师模型会导致精度下降,而出人意料的是,采用容量小于学生模型的教师反而可能提升结果。我们的代码将公开发布。