SelectLLM: Can LLMs Select Important Instructions to Annotate?

Instruction tuning benefits from large and diverse datasets, however creating such datasets involves a high cost of human labeling. While synthetic datasets generated by large language models (LLMs) have partly solved this issue, they often contain low-quality data. One effective solution is selectively annotating unlabelled instructions, especially given the relative ease of acquiring unlabeled instructions or texts from various sources. However, how to select unlabelled instructions is not well-explored, especially in the context of LLMs. Further, traditional data selection methods, relying on input embedding space density, tend to underestimate instruction sample complexity, whereas those based on model prediction uncertainty often struggle with synthetic label quality. Therefore, we introduce SelectLLM, an alternative framework that leverages the capabilities of LLMs to more effectively select unlabeled instructions. SelectLLM consists of two key steps: Coreset-based clustering of unlabelled instructions for diversity and then prompting a LLM to identify the most beneficial instructions within each cluster. Our experiments demonstrate that SelectLLM matches or outperforms other state-of-the-art methods in instruction tuning benchmarks. It exhibits remarkable consistency across human and synthetic datasets, along with better cross-dataset generalization, as evidenced by a 10% performance improvement on the Cleaned Alpaca test set when trained on Dolly data. All code and data are publicly available (https://github.com/minnesotanlp/select-llm).

翻译：指令微调受益于大规模且多样化的数据集，然而创建此类数据集涉及高昂的人工标注成本。尽管大语言模型生成的合成数据集部分解决了这一问题，但这类数据往往包含低质量内容。一种有效方案是选择性标注未标注指令——尤其是在从多源获取未标注指令或文本相对容易的背景下。然而，如何有效选择未标注指令这一关键问题仍未得到充分探索，尤其是在大语言模型应用场景中。此外，传统数据选择方法（依赖输入嵌入空间密度）倾向于低估指令样本复杂度，而基于模型预测不确定性的方法则常受制于合成标签质量。为此，我们提出SelectLLM——一种利用大语言模型能力更高效选择未标注指令的替代框架。SelectLLM包含两个核心步骤：基于核心集的未标注指令聚类以实现多样性，随后提示大语言模型识别每个聚类中最具价值的指令。实验表明，SelectLLM在指令微调基准测试中匹配甚至超越现有最优方法。该方法在人工与合成数据集上均表现出显著一致性，并具有更强的跨数据集泛化能力——当使用Dolly数据训练时，在Cleaned Alpaca测试集上实现10%的性能提升。所有代码与数据均已开源（https://github.com/minnesotanlp/select-llm）。