Instruction tuning benefits from large and diverse datasets, however creating such datasets involves a high cost of human labeling. While synthetic datasets generated by large language models (LLMs) have partly solved this issue, they often contain low-quality data. One effective solution is selectively annotating unlabelled instructions, especially given the relative ease of acquiring unlabeled instructions or texts from various sources. However, how to select unlabelled instructions is not well-explored, especially in the context of LLMs. Further, traditional data selection methods, relying on input embedding space density, tend to underestimate instruction sample complexity, whereas those based on model prediction uncertainty often struggle with synthetic label quality. Therefore, we introduce SelectLLM, an alternative framework that leverages the capabilities of LLMs to more effectively select unlabeled instructions. SelectLLM consists of two key steps: Coreset-based clustering of unlabelled instructions for diversity and then prompting a LLM to identify the most beneficial instructions within each cluster. Our experiments demonstrate that SelectLLM matches or outperforms other state-of-the-art methods in instruction tuning benchmarks. It exhibits remarkable consistency across human and synthetic datasets, along with better cross-dataset generalization, as evidenced by a 10% performance improvement on the Cleaned Alpaca test set when trained on Dolly data. All code and data are publicly available (https://github.com/minnesotanlp/select-llm).
翻译:摘要:指令微调受益于大规模且多样化的数据集,然而构建此类数据集需投入高昂的人工标注成本。尽管由大语言模型(LLM)生成的合成数据集已部分缓解此问题,但其往往包含低质量数据。一种有效解决方案是选择性标注未标记指令,尤其是考虑到从多种来源获取未标记指令或文本相对容易。然而,如何选择未标记指令在LLM背景下仍缺乏深入探索。此外,依赖输入嵌入空间密度的传统数据选择方法往往低估指令样本复杂度,而基于模型预测不确定性的方法则常受限于合成标签质量。为此,我们提出SelectLLM——一种利用LLM能力更高效筛选未标记指令的替代框架。SelectLLM包含两个关键步骤:基于核心集的未标记指令聚类以保持多样性,随后提示LLM识别每个簇中获益最大的指令。实验表明,SelectLLM在指令微调基准测试中匹配或超越现有最优方法。该方法在人工与合成数据集上展现卓越一致性,并具备更强跨数据集泛化能力——在Dolly数据训练后,Clean Alpaca测试集性能提升达10%。所有代码与数据均已公开(https://github.com/minnesotanlp/select-llm)。