Aligning large language models(LLMs) with human is a critical step in effectively utilizing their pre-trained capabilities across a wide array of language tasks. Current instruction tuning practices often rely on expanding dataset size without a clear strategy for ensuring data quality, which can inadvertently introduce noise and degrade model performance. To address this challenge, we introduce Nuggets, a novel and efficient methodology that employs one shot learning to select high-quality instruction data from expansive datasets. Nuggets assesses the potential of individual instruction examples to act as effective one shot examples, thereby identifying those that can significantly enhance diverse task performance. Nuggets utilizes a scoring system based on the impact of candidate examples on the perplexity of a diverse anchor set, facilitating the selection of the most beneficial data for instruction tuning. Through rigorous testing on two benchmarks, including MT-Bench and Alpaca-Eval, we demonstrate that instruction tuning with the top 1% of Nuggets-curated examples substantially outperforms conventional methods that use the full dataset. These findings advocate for a data selection paradigm that prioritizes quality, offering a more efficient pathway to align LLMs with humans.
翻译:将大型语言模型(LLMs)与人类对齐,是在广泛语言任务中有效利用其预训练能力的关键步骤。当前的指令微调实践往往依赖扩大数据集规模,却缺乏确保数据质量的明确策略,这可能无意中引入噪声并降低模型性能。为应对这一挑战,我们提出Nuggets——一种新颖且高效的方法,通过单样本学习从海量数据集中筛选高质量指令数据。Nuggets评估单个指令示例作为有效单样本示例的潜力,从而识别能显著增强多样任务性能的数据。该方法基于候选示例对多样化锚定集合的困惑度影响来设计评分系统,从而筛选出对指令微调最有利的数据。通过在MT-Bench和Alpaca-Eval两个基准上的严格测试,我们证明,使用Nuggets筛选出的前1%示例进行指令微调,其性能显著优于使用完整数据集的传统方法。这些研究结果倡导一种优先考虑质量的数据筛选范式,为将LLMs与人类对齐提供了一条更高效的路径。