Despite the effectiveness of data selection for large language models (LLMs) during pretraining and instruction fine-tuning phases, improving data efficiency in supervised fine-tuning (SFT) for specialized domains poses significant challenges due to the complexity of fine-tuning data. To bridge this gap, we introduce an effective and scalable data selection method for SFT, SmallToLarge (S2L), which leverages training trajectories from small models to guide the data selection for larger models. We demonstrate through extensive experiments that S2L significantly improves data efficiency in SFT for mathematical problem-solving, reducing the training data to just 11% of the original MathInstruct dataset (Yue et al., 2023) to match full dataset performance while outperforming state-of-the-art data selection algorithms by an average of 4.7% across 6 in- and out-domain evaluation datasets. Remarkably, selecting only 50K data for SFT, S2L achieves a 32.7% accuracy on the most challenging MATH (Hendrycks et al., 2021) benchmark, improving Phi-2 (Li et al., 2023b) by 16.6%. In clinical text summarization on the MIMIC-III dataset (Johnson et al., 2016), S2L again outperforms training on the full dataset using only 50% of the data. Notably, S2L can perform data selection using a reference model 40x smaller than the target model, proportionally reducing the cost of data selection.
翻译:尽管在预训练和指令微调阶段,数据选择对大语言模型(LLMs)具有显著效果,但由于微调数据的复杂性,在面向专业领域的监督式微调(SFT)中提升数据效率仍面临重大挑战。为弥合这一差距,我们提出了一种面向SFT的有效且可扩展的数据选择方法——SmallToLarge(S2L),该方法利用小模型的训练轨迹来指导大模型的数据选择。通过大量实验证明,S2L显著提升了数学问题求解场景下SFT的数据效率:仅需MathInstruct数据集(Yue等,2023)原始数据的11%即可达到全数据集性能,且在6个域内和域外评估数据集上平均优于现有最优数据选择算法4.7%。值得注意的是,仅使用50K数据选择进行SFT时,S2L在最具挑战性的MATH基准测试(Hendrycks等,2021)上达到32.7%的准确率,相比Phi-2(Li等,2023b)提升16.6%。在MIMIC-III数据集(Johnson等,2016)的临床文本摘要任务中,S2L仅使用50%的数据即可超越全数据集训练效果。特别地,S2L可使用比目标模型小40倍的参考模型完成数据选择,从而按比例降低数据选择的计算成本。