Recent advancements in instruction tuning for large language models (LLMs) suggest that a small, high-quality dataset can significantly equip LLMs with instruction-following capabilities, outperforming large datasets often burdened by quality and redundancy issues. However, the challenge lies in automatically identifying valuable subsets from large datasets to boost both the effectiveness and efficiency of instruction tuning. In this paper, we first establish data selection criteria based on three distinct aspects of data value: diversity, difficulty, and dependability, and then propose the D3 method comprising two key steps of scoring and selection. Specifically, in the scoring step, we define the diversity function to measure sample distinctiveness and introduce the uncertainty-based prediction difficulty to evaluate sample difficulty by mitigating the interference of context-oriented generation diversity. Additionally, we integrate an external LLM for dependability assessment. In the selection step, we formulate the D3 weighted coreset objective, which jointly optimizes three aspects of data value to solve for the most valuable subset. The two steps of D3 can iterate multiple rounds, incorporating feedback to refine the selection focus adaptively. Experiments on three datasets demonstrate the effectiveness of D3 in endowing LLMs with competitive or even superior instruction-following capabilities using less than 10% of the entire dataset.
翻译:近期大语言模型(LLM)指令调优的研究进展表明,一个规模较小但高质量的指令数据集能够显著赋予LLM遵循指令的能力,其效果往往优于因质量和冗余问题而受限的大规模数据集。然而,如何从大规模数据集中自动识别出具有高价值的子集,以同时提升指令调优的效果与效率,仍是一个挑战。本文首先从数据价值的三个维度——多样性、难度和可靠性——建立了数据选择准则,进而提出了包含评分与选择两个关键步骤的D3方法。具体而言,在评分步骤中,我们定义了多样性函数以衡量样本的独特性,并引入了基于不确定性的预测难度来评估样本的难度,同时通过缓解面向上下文的生成多样性所带来的干扰。此外,我们还集成了一个外部LLM进行可靠性评估。在选择步骤中,我们构建了D3加权核心集目标函数,该函数联合优化数据价值的三个维度,以求解出最具价值的数据子集。D3的两个步骤可进行多轮迭代,通过融入反馈以自适应地优化选择焦点。在三个数据集上的实验表明,D3方法能够使用不到全部数据10%的样本,使LLM获得具有竞争力甚至更优的指令遵循能力。