Data pruning algorithms are commonly used to reduce the memory and computational cost of the optimization process. Recent empirical results reveal that random data pruning remains a strong baseline and outperforms most existing data pruning methods in the high compression regime, i.e., where a fraction of $30\%$ or less of the data is kept. This regime has recently attracted a lot of interest as a result of the role of data pruning in improving the so-called neural scaling laws; in [Sorscher et al.], the authors showed the need for high-quality data pruning algorithms in order to beat the sample power law. In this work, we focus on score-based data pruning algorithms and show theoretically and empirically why such algorithms fail in the high compression regime. We demonstrate ``No Free Lunch" theorems for data pruning and present calibration protocols that enhance the performance of existing pruning algorithms in this high compression regime using randomization.
翻译:数据剪枝算法通常用于降低优化过程中的内存和计算成本。近期实证结果表明,随机数据剪枝仍是一个强劲的基线方法,并且在高压缩率场景(即仅保留数据集的30%或更少部分)下优于大多数现有数据剪枝算法。由于数据剪枝在改善所谓神经缩放定律中的作用,该场景近期引起了广泛关注;在[Sorscher等人的研究]中,作者指出需要高质量的数据剪枝算法才能超越样本幂律。本文聚焦于基于评分的数据剪枝算法,从理论与实证角度揭示了此类算法在高压缩率场景下失效的原因。我们提出了数据剪枝的“无免费午餐”定理,并给出了校准协议,通过引入随机化来提升现有剪枝算法在该高压缩率场景下的性能。