Data pruning algorithms are commonly used to reduce the memory and computational cost of the optimization process. Recent empirical results reveal that random data pruning remains a strong baseline and outperforms most existing data pruning methods in the high compression regime, i.e., where a fraction of $30\%$ or less of the data is kept. This regime has recently attracted a lot of interest as a result of the role of data pruning in improving the so-called neural scaling laws; in [Sorscher et al.], the authors showed the need for high-quality data pruning algorithms in order to beat the sample power law. In this work, we focus on score-based data pruning algorithms and show theoretically and empirically why such algorithms fail in the high compression regime. We demonstrate ``No Free Lunch" theorems for data pruning and present calibration protocols that enhance the performance of existing pruning algorithms in this high compression regime using randomization.
翻译:数据剪枝算法常用于降低优化过程中的内存与计算成本。近期实证结果表明,随机数据剪枝作为强基线方法,在高压缩场景下(即仅保留30%或更少数据时)仍优于大多数现有数据剪枝算法。由于数据剪枝在改进所谓神经缩放定律中的作用,此类场景近期引起了广泛关注;在[Sorscher等人]的工作中,作者展示了需要高质量数据剪枝算法才能超越样本幂律。本研究聚焦于基于评分的数据剪枝算法,从理论与实证层面揭示了此类算法在高压缩场景下失效的原因。我们提出了数据剪枝的"无免费午餐"定理,并展示了通过随机化增强现有剪枝算法在此类高压缩场景中性能的校准协议。