There is a consensus that instruction fine-tuning of LLMs requires high-quality data, but what are they? LIMA (NeurIPS 2023) and AlpaGasus (ICLR 2024) are state-of-the-art methods for selecting such high-quality examples, either via manual curation or using GPT-3.5-Turbo as a quality scorer. We show that the extremely simple baseline of selecting the 1,000 instructions with longest responses from standard datasets can consistently outperform these sophisticated methods according to GPT-4 and PaLM-2 as judges, while remaining competitive on the OpenLLM benchmarks that test factual knowledge. We demonstrate this for several state-of-the-art LLMs (Llama-2-7B, Llama-2-13B, and Mistral-7B) and datasets (Alpaca-52k and Evol-Instruct-70k). In addition, a lightweight refinement of such long instructions can further improve the abilities of the fine-tuned LLMs, and allows us to obtain the 2nd highest-ranked Llama-2-7B-based model on AlpacaEval 2.0 while training on only 1,000 examples and no extra preference data. We also conduct a thorough analysis of our models to ensure that their enhanced performance is not simply due to GPT-4's preference for longer responses, thus ruling out any artificial improvement. In conclusion, our findings suggest that fine-tuning on the longest instructions should be the default baseline for any research on instruction fine-tuning.
翻译:大型语言模型的指令微调需要高质量数据已是共识,但高质量数据的评判标准是什么?LIMA(NeurIPS 2023)和AlpaGasus(ICLR 2024)是通过人工筛选或使用GPT-3.5-Turbo作为质量评分器来选择高质量样本的先进方法。我们发现,从标准数据集中选取响应最长的1,000条指令这一极其简单的基线方法,在GPT-4和PaLM-2作为裁判的评估中始终优于这些复杂方法,同时在测试事实知识的OpenLLM基准上仍具有竞争力。我们针对多个先进LLM(Llama-2-7B、Llama-2-13B和Mistral-7B)和数据集(Alpaca-52k和Evol-Instruct-70k)验证了这一点。此外,对此类长指令进行轻量级改进可进一步提升微调后LLM的能力,使我们仅基于1,000个样本且无需额外偏好数据,即可在AlpacaEval 2.0上获得排名第二的Llama-2-7B模型。我们还对模型进行了深入分析,确保其性能提升并非因GPT-4偏好较长响应所致,从而排除了人为改进的可能性。综上所述,我们的研究结果表明,基于最长指令进行微调应成为所有指令微调研究的默认基线方法。