Large language models~(LLMs) strengthen instruction-following capability through instruction-finetuning (IFT) on supervised instruction/response data. However, widely used IFT datasets (e.g., Alpaca's 52k data) surprisingly contain many low-quality instances with incorrect or irrelevant responses, which are misleading and detrimental to IFT. In this paper, we propose a simple and effective data selection strategy that automatically identifies and filters out low-quality data using a strong LLM (e.g., ChatGPT). To this end, we introduce AlpaGasus, which is finetuned on only 9k high-quality data filtered from the 52k Alpaca data. AlpaGasus significantly outperforms the original Alpaca as evaluated by GPT-4 on multiple test sets and the controlled human evaluation. Its 13B variant matches $>90\%$ performance of its teacher LLM (i.e., Text-Davinci-003 generating the 52k data) on test tasks. It also provides 5.7x faster training, reducing the training time for a 7B variant from 80 minutes (for Alpaca) to 14 minutes. Moreover, the experiments prove the efficacy of our method across diverse datasets, base models, and LLM filters. Overall, AlpaGasus demonstrates a novel data-centric IFT paradigm that can be generally applied to instruction-tuning data, leading to faster training and better instruction-following models. Our project page is available at: \url{https://lichang-chen.github.io/AlpaGasus/}
翻译:大型语言模型(LLMs)通过监督式指令/响应数据的指令微调(IFT)增强指令遵循能力。然而,广泛使用的IFT数据集(例如Alpaca的52k数据)意外地包含大量具有错误或不相关响应的低质量实例,这些数据具有误导性且对IFT有害。本文提出一种简单有效的数据选择策略,该策略利用强LLM(例如ChatGPT)自动识别并过滤低质量数据。据此,我们引入了AlpaGasus,该模型仅在从52k Alpaca数据中过滤出的9k高质量数据上进行微调。在GPT-4评估的多个测试集以及受控人工评估中,AlpaGasus显著优于原始Alpaca。其13B变体在测试任务上实现了与教师LLM(即生成52k数据的Text-Davinci-003)匹配的>90%性能。同时,该模型实现了5.7倍的更快训练,将7B变体的训练时间从80分钟(Alpaca)缩短至14分钟。此外,实验证明了我们的方法在不同数据集、基础模型和LLM过滤器上的有效性。总体而言,AlpaGasus展示了一种新颖的数据中心型IFT范式,可普遍应用于指令微调数据,从而实现更快的训练和更好的指令遵循模型。我们的项目页面位于:\url{https://lichang-chen.github.io/AlpaGasus/}