A Critical Look at Targeted Instruction Selection: Disentangling What Matters (and What Doesn't)

Instruction fine-tuning of large language models (LLMs) often involves selecting a subset of instruction training data from a large candidate pool, using a small query set from the target task. Despite growing interest, the literature on targeted instruction selection remains fragmented and opaque: methods vary widely in selection budgets, often omit zero-shot baselines, and frequently entangle the contributions of key components. As a result, practitioners lack actionable guidance on selecting instructions for their target tasks. In this work, we aim to bring clarity to this landscape by disentangling and systematically analyzing the two core ingredients: data representation and selection algorithms. Our framework enables controlled comparisons across models, tasks, and budgets. We find that only gradient-based data representations choose subsets whose similarity to the query consistently predicts performance across datasets and models. While no single method dominates, gradient-based representations paired with a greedy round-robin selection algorithm tend to perform best on average at low budgets, but these benefits diminish at larger budgets. Finally, we unify several existing selection algorithms as forms of approximate distance minimization between the selected subset and the query set, and support this view with new generalization bounds. More broadly, our findings provide critical insights and a foundation for more principled data selection in LLM fine-tuning. The code is available at https://github.com/dcml-lab/targeted-instruction-selection.

翻译：大型语言模型（LLM）的指令微调通常涉及从大型候选池中选择指令训练数据的子集，并使用来自目标任务的少量查询集。尽管相关研究日益增多，但关于针对性指令选择的文献仍呈现碎片化且不透明：方法在选取预算上差异巨大，常忽略零样本基线，且频繁混淆关键组件的贡献。因此，实践者缺乏针对其目标任务选择指令的可操作指导。在本研究中，我们旨在通过解耦并系统分析两个核心要素——数据表示与选择算法——来厘清这一领域。我们的框架支持在模型、任务和预算间进行受控比较。研究发现，仅基于梯度的数据表示方法所选子集与查询集的相似性能够一致地预测不同数据集和模型的性能。尽管没有单一方法占据绝对优势，但在低预算条件下，基于梯度的表示与贪心轮询选择算法组合通常平均表现最佳，但这些优势在更大预算下会减弱。最后，我们将多种现有选择算法统一为选定子集与查询集之间近似距离最小化的不同形式，并通过新的泛化界限支持这一观点。更广泛而言，我们的研究结果为LLM微调中更原则化的数据选择提供了关键见解和基础。代码发布于https://github.com/dcml-lab/targeted-instruction-selection。