Large language models typically undergo two training stages, pretraining and finetuning. Despite that large-scale pretraining endows the model with strong capabilities to generate natural language responses, these pretrained models can still fail to understand human instructions at times. To enhance language models' ability of interpreting and responding to instructions, instruction finetuning has emerged as a critical method in this area. Recent studies found that large language models can be finetuned to perform well even with a small amount of high-quality instruction-following data. However, the selection of high-quality datasets for finetuning language models still lacks clear guidelines to follow. In this paper, we propose InstructMining, a linear rule for evaluating instruction-following data quality. We formulate InstructMining using specific natural language indicators. To investigate the relationship between data quality and these indicators, we further conduct extensive finetuning experiments. The experiment results are then applied to estimating parameters in InstructMining. To further investigate its performance, we use InstructMining to select high-quality data from unseen datasets. Results demonstrate that InstructMining can help select relatively high-quality samples from various instruction-following datasets. Compared to models finetuned on unfiltered datasets, models finetuned on InstructMining selected datasets perform better on 42.5% cases.
翻译:大语言模型通常经历预训练和微调两个训练阶段。尽管大规模预训练赋予了模型生成自然语言回应的强大能力,但这些预训练模型有时仍无法理解人类指令。为增强语言模型解读和响应指令的能力,指令微调已成为该领域的关键方法。近期研究发现,即便仅使用少量高质量的指令遵循数据,大语言模型也能通过微调取得良好表现。然而,语言模型微调中高质量数据集的选择仍缺乏明确的指导准则。本文提出InstructMining——一种评估指令遵循数据质量的线性规则。我们利用特定自然语言指标构建了InstructMining。为探究数据质量与这些指标间的关系,我们进一步开展了广泛的微调实验,并将实验结果用于估算InstructMining中的参数。为验证其性能,我们使用InstructMining从未知数据集中筛选高质量数据。结果表明,InstructMining能从不同指令遵循数据集中筛选出相对高质量的样本。与基于未筛选数据微调的模型相比,基于InstructMining筛选数据微调的模型在42.5%的案例中表现更优。