Large language models~(LLMs) obtain instruction-following capability through instruction-finetuning (IFT) on supervised instruction/response data. However, widely used IFT datasets (e.g., Alpaca's 52k data) surprisingly contain many low-quality instances with incorrect or irrelevant responses, which are misleading and detrimental to IFT. In this paper, we propose a simple and effective data selection strategy that automatically identifies and removes low-quality data using a strong LLM (e.g., ChatGPT). To this end, we introduce AlpaGasus, which is finetuned on only 9k high-quality data filtered from the 52k Alpaca data. AlpaGasus significantly outperforms the original Alpaca as evaluated by GPT-4 on multiple test sets and its 13B variant matches $>90\%$ performance of its teacher LLM (i.e., Text-Davinci-003) on test tasks. It also provides 5.7x faster training, reducing the training time for a 7B variant from 80 minutes (for Alpaca) to 14 minutes \footnote{We apply IFT for the same number of epochs as Alpaca(7B) but on fewer data, using 4$\times$NVIDIA A100 (80GB) GPUs and following the original Alpaca setting and hyperparameters.}. Overall, AlpaGasus demonstrates a novel data-centric IFT paradigm that can be generally applied to instruction-tuning data, leading to faster training and better instruction-following models. Our project page is available at: \url{https://lichang-chen.github.io/AlpaGasus/}.
翻译:大语言模型通过指令微调(IFT)和监督式指令/响应数据获得遵循指令的能力。然而,广泛使用的IFT数据集(如Alpaca的5.2万条数据)意外包含大量低质量样本,其响应错误或无关,对IFT具有误导性和有害性。本文提出一种简单有效的数据选择策略,利用强语言模型(如ChatGPT)自动识别并移除低质量数据。由此引入AlpaGasus模型,仅使用从5.2万条Alpaca数据中筛选出的9000条高质量数据进行微调。经GPT-4在多个测试集上的评估,AlpaGasus显著优于原始Alpaca模型;其13B变体在测试任务中的性能匹配教师模型(即Text-Davinci-003)的90%以上。此外,该模型训练速度提升5.7倍,将7B变体的训练时间从(Alpaca的)80分钟缩短至14分钟\footnote{我们采用与Alpaca(7B)相同的迭代次数进行IFT,但使用更少的数据,在4×NVIDIA A100(80GB)GPU上沿用原始Alpaca设置和超参数。}。总体而言,AlpaGasus展示了一种可普遍应用于指令微调数据的新型数据驱动IFT范式,从而加速训练并构建更优的指令遵循模型。项目主页:\url{https://lichang-chen.github.io/AlpaGasus/}。