With contributions from the open-source community, a vast amount of instruction tuning (IT) data has emerged. Given the significant resource allocation required by training and evaluating models, it is advantageous to have an efficient method for selecting high-quality IT data. However, existing methods for instruction data selection have limitations such as relying on fragile external APIs, being affected by biases in GPT models, or reducing the diversity of the selected instruction dataset. In this paper, we propose an industrial-friendly, expert-aligned and diversity-preserved instruction data selection method: Clustering and Ranking (CaR). CaR consists of two steps. The first step involves ranking instruction pairs using a scoring model that is well aligned with expert preferences (achieving an accuracy of 84.25%). The second step involves preserving dataset diversity through a clustering process.In our experiment, CaR selected a subset containing only 1.96% of Alpaca's IT data, yet the underlying AlpaCaR model trained on this subset outperforms Alpaca by an average of 32.1% in GPT-4 evaluations. Furthermore, our method utilizes small models (355M parameters) and requires only 11.2% of the monetary cost compared to existing methods, making it easily deployable in industrial scenarios.
翻译:得益于开源社区的贡献,大量指令微调(IT)数据得以涌现。鉴于训练和评估模型需要耗费大量资源,采用高效方法筛选高质量IT数据具有显著优势。然而,现有指令数据选择方法存在局限性,例如依赖脆弱的对外应用程序接口、受GPT模型偏差影响,或降低所选指令数据集的多样性。本文提出一种面向工业场景、与专家对齐且保持多样性的指令数据选择方法:聚类与排序(Clustering and Ranking, CaR)。CaR包含两个步骤:首先,利用与专家偏好高度对齐的评分模型(准确率达84.25%)对指令对进行排序;其次,通过聚类过程保持数据集的多样性。实验中,CaR仅选取Alpaca数据集中1.96%的IT数据,但基于该子集训练的AlpaCaR模型在GPT-4评估中平均性能超越Alpaca达32.1%。此外,本方法使用小规模模型(355M参数),仅需现有方法11.2%的经济成本,可便捷部署于工业场景。