Instruction tuning benefits from large and diverse datasets, however creating such datasets involves a high cost of human labeling. While synthetic datasets generated by large language models (LLMs) have partly solved this issue, they often contain low-quality data. One effective solution is selectively annotating unlabelled instructions, especially given the relative ease of acquiring unlabeled instructions or texts from various sources. However, how to select unlabelled instructions is not well-explored, especially in the context of LLMs. Further, traditional data selection methods, relying on input embedding space density, tend to underestimate instruction sample complexity, whereas those based on model prediction uncertainty often struggle with synthetic label quality. Therefore, we introduce SelectLLM, an alternative framework that leverages the capabilities of LLMs to more effectively select unlabeled instructions. SelectLLM consists of two key steps: Coreset-based clustering of unlabelled instructions for diversity and then prompting a LLM to identify the most beneficial instructions within each cluster. Our experiments demonstrate that SelectLLM matches or outperforms other state-of-the-art methods in instruction tuning benchmarks. It exhibits remarkable consistency across human and synthetic datasets, along with better cross-dataset generalization, as evidenced by a 10% performance improvement on the Cleaned Alpaca test set when trained on Dolly data. All code and data are publicly available (https://github.com/minnesotanlp/select-llm).
翻译:指令微调受益于大规模多样化的数据集,但创建此类数据集需要高昂的人工标注成本。虽然大语言模型生成的合成数据集已部分解决该问题,但这些数据往往包含低质量样本。一个有效方案是对未标注指令进行选择性标注,特别是考虑到从多源获取未标注指令或文本相对容易。然而,如何选择未标注指令的研究尚不充分,尤其是在大语言模型语境下。此外,传统数据选择方法依赖于输入嵌入空间密度,倾向于低估指令样本复杂度;而基于模型预测不确定性的方法则常受限于合成标签质量。为此,我们提出SelectLLM框架,通过利用大语言模型的能力更有效地筛选未标注指令。SelectLLM包含两个核心步骤:基于核心集的未标注指令聚类以保障多样性,随后提示大语言模型识别每个聚类中最具价值的指令。实验表明,SelectLLM在指令微调基准测试中达到或超越现有最优方法。该框架在人工与合成数据集上展现出显著一致性,且具备更优的跨数据集泛化能力——例如在Dolly数据上训练后,Clean Alpaca测试集的性能提升10%。所有代码与数据均已开源(https://github.com/minnesotanlp/select-llm)。