Instruction tuning has unlocked powerful capabilities in large language models (LLMs), effectively using combined datasets to develop generalpurpose chatbots. However, real-world applications often require a specialized suite of skills (e.g., reasoning). The challenge lies in identifying the most relevant data from these extensive datasets to effectively develop specific capabilities, a setting we frame as targeted instruction tuning. We propose LESS, an optimizer-aware and practically efficient algorithm to effectively estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection. Crucially, LESS adapts existing influence formulations to work with the Adam optimizer and variable-length instruction data. LESS first constructs a highly reusable and transferable gradient datastore with low-dimensional gradient features and then selects examples based on their similarity to few-shot examples embodying a specific capability. Experiments show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks. Furthermore, the selected data is highly transferable: smaller models can be leveraged to select useful data for larger models and models from different families. Our qualitative analysis shows that our method goes beyond surface form cues to identify data that exemplifies the necessary reasoning skills for the intended downstream application.
翻译:指令微调赋予了大型语言模型强大的能力,通过整合多源数据集开发通用型聊天机器人。然而实际应用通常需要特定的技能组合(如逻辑推理)。如何从海量数据中识别最相关的数据以有效培养特定能力,是我们提出的"目标指令微调"场景下的核心挑战。本文提出LESS方法,这是一种优化器感知且计算高效的算法,通过低秩梯度相似性搜索有效估计数据影响力并选择指令数据。关键创新在于将现有影响力计算框架适配至Adam优化器与变长指令数据场景。LESS首先构建可高度复用与迁移的梯度数据仓库(基于低维梯度特征),继而根据样本与体现特定能力的少量示例之间的相似性进行选择。实验表明,在多种下游任务中,仅用LESS筛选的5%数据进行训练即可超越全量数据训练效果。此外,所选数据具有强迁移性:小模型可用于为大模型及不同架构的模型筛选有效数据。定性分析显示,该方法能突破表层形式线索,精准识别蕴含目标应用场景所需推理能力的训练数据。