Large language models have demonstrated strong performance in recent years, but the high cost of training drives the need for efficient methods to compress dataset sizes. We propose TED pruning, a method that addresses the challenge of overfitting under high pruning ratios by quantifying the model's ability to improve performance on pruned data while fitting retained data, known as Internal Generalization (IG). TED uses an optimization objective based on Internal Generalization Distance (IGD), measuring changes in IG before and after pruning to align with true generalization performance and achieve implicit regularization. The IGD optimization objective was verified to allow the model to achieve the smallest upper bound on generalization error. The impact of small mask fluctuations on IG is studied through masks and Taylor approximation, and fast estimation of IGD is enabled. In analyzing continuous training dynamics, the prior effect of IGD is validated, and a progressive pruning strategy is proposed. Experiments on image classification, natural language understanding, and large language model fine-tuning show TED achieves lossless performance with 60-70\% of the data. Upon acceptance, our code will be made publicly available.
翻译:近年来,大语言模型展现出强大的性能,但高昂的训练成本推动了对压缩数据集规模的高效方法的需求。我们提出TED剪枝方法,该方法通过量化模型在拟合保留数据的同时提升剪枝数据性能的能力(称为内部泛化,IG),以解决高剪枝比例下过拟合的挑战。TED采用基于内部泛化距离(IGD)的优化目标,通过测量剪枝前后IG的变化来对齐真实泛化性能,实现隐式正则化。经验证,IGD优化目标可使模型达到泛化误差的最小上界。通过掩码和泰勒近似研究了小规模掩码波动对IG的影响,并实现了IGD的快速估计。在分析连续训练动态时,验证了IGD的先验效应,并提出了一种渐进式剪枝策略。在图像分类、自然语言理解和大语言模型微调上的实验表明,TED仅需60-70%的数据即可实现无损性能。论文录用后,我们的代码将公开提供。