With the increasing size of datasets used for training neural networks, data pruning becomes an attractive field of research. However, most current data pruning algorithms are limited in their ability to preserve accuracy compared to models trained on the full data, especially in high pruning regimes. In this paper we explore the application of data pruning while incorporating knowledge distillation (KD) when training on a pruned subset. That is, rather than relying solely on ground-truth labels, we also use the soft predictions from a teacher network pre-trained on the complete data. By integrating KD into training, we demonstrate significant improvement across datasets, pruning methods, and on all pruning fractions. We first establish a theoretical motivation for employing self-distillation to improve training on pruned data. Then, we empirically make a compelling and highly practical observation: using KD, simple random pruning is comparable or superior to sophisticated pruning methods across all pruning regimes. On ImageNet for example, we achieve superior accuracy despite training on a random subset of only 50% of the data. Additionally, we demonstrate a crucial connection between the pruning factor and the optimal knowledge distillation weight. This helps mitigate the impact of samples with noisy labels and low-quality images retained by typical pruning algorithms. Finally, we make an intriguing observation: when using lower pruning fractions, larger teachers lead to accuracy degradation, while surprisingly, employing teachers with a smaller capacity than the student's may improve results. Our code will be made available.
翻译:随着用于训练神经网络的数据库规模日益增大,数据剪枝已成为一个极具吸引力的研究领域。然而,当前大多数数据剪枝算法在保持模型精度方面存在局限,尤其是在高剪枝比例下,其性能仍难以与完整数据训练的模型相媲美。本文探讨了在剪枝数据子集上训练时结合知识蒸馏(KD)的数据剪枝应用。具体而言,我们不仅依赖真实标签,还利用在完整数据上预训练的教师网络的软预测结果。通过将知识蒸馏融入训练过程,我们在多个数据集、不同剪枝方法及所有剪枝比例上均实现了显著性能提升。我们首先从理论层面论证了采用自蒸馏技术以优化剪枝数据训练的动机。随后,通过实证研究获得了一个极具说服力且高度实用的发现:引入知识蒸馏后,简单的随机剪枝在所有剪枝比例下均可媲美甚至优于复杂的剪枝方法。例如在ImageNet数据集上,仅使用50%数据的随机子集进行训练,我们仍获得了更优的精度。此外,我们揭示了剪枝因子与最优知识蒸馏权重之间的关键关联,这有助于缓解典型剪枝算法中噪声标签样本和低质量图像带来的负面影响。最后,我们发现一个有趣现象:在较低剪枝比例下,使用容量过大的教师网络会导致精度下降;而令人惊讶的是,采用容量小于学生网络的教师模型反而可能提升性能。相关代码将公开提供。