Despite the effectiveness of data selection for large language models (LLMs) during pretraining and instruction fine-tuning phases, improving data efficiency in supervised fine-tuning (SFT) for specialized domains poses significant challenges due to the complexity of fine-tuning data. To bridge this gap, we introduce an effective and scalable data selection method for SFT, SmallToLarge (S2L), which leverages training trajectories from small models to guide the data selection for larger models. We demonstrate through extensive experiments that S2L significantly improves data efficiency in SFT for mathematical problem-solving, reducing the training data to just 11% of the original MathInstruct dataset (Yue et al., 2023) to match full dataset performance while outperforming state-of-the-art data selection algorithms by an average of 4.7% across 6 in- and out-domain evaluation datasets. Remarkably, selecting only 50K data for SFT, S2L achieves a 32.7% accuracy on the most challenging MATH (Hendrycks et al., 2021) benchmark, improving Phi-2 (Li et al., 2023b) by 16.6%. In clinical text summarization on the MIMIC-III dataset (Johnson et al., 2016), S2L again outperforms training on the full dataset using only 50% of the data. Notably, S2L can perform data selection using a reference model 40x smaller than the target model, proportionally reducing the cost of data selection.
翻译:尽管数据选择在大语言模型(LLM)的预训练和指令微调阶段已被证明有效,但由于微调数据的复杂性,在面向专业领域的有监督微调(SFT)中提升数据效率仍面临重大挑战。为弥合这一差距,我们提出了一种针对SFT的有效且可扩展的数据选择方法——SmallToLarge (S2L),该方法利用小模型的训练轨迹来指导大模型的数据选择。我们通过大量实验证明,在数学问题求解的SFT中,S2L显著提升了数据效率,仅需原始MathInstruct数据集(Yue等人,2023)11%的训练数据即可达到全数据集性能,同时在6个领域内和领域外评估数据集上平均超越最先进的数据选择算法4.7%。值得注意的是,仅选择50K数据进行SFT,S2L在最具挑战性的MATH基准测试(Hendrycks等人,2021)上取得了32.7%的准确率,将Phi-2模型(Li等人,2023b)的性能提升了16.6%。在MIMIC-III数据集(Johnson等人,2016)上的临床文本摘要任务中,S2L再次仅使用50%的数据就超越了在全数据集上训练的性能。尤为突出的是,S2L能够使用比目标模型小40倍的参考模型进行数据选择,从而按比例降低了数据选择的成本。