Supervised fine-tuning (SFT) is fundamental to adapting large language models, yet training on complete datasets incurs prohibitive costs with diminishing returns. Existing data selection methods suffer from severe domain specificity: techniques optimized for general instruction-following fail on reasoning tasks, and vice versa. We observe that measuring entropy differences between base models and minimally instruction-tuned calibrated models reveals a pattern -- samples with the lowest differential entropy consistently yield optimal performance across domains, yet this principle manifests domain-adaptively: reasoning tasks favor entropy increase (cognitive expansion), while general tasks favor entropy decrease (cognitive compression). We introduce InstructDiff, a unified framework that operationalizes differential entropy as a domain-adaptive selection criterion through warmup calibration, bi-directional NLL filtering, and entropy-based ranking. Extensive experiments show that InstructDiff achieves 17\% relative improvement over full data training on mathematical reasoning and 52\% for general instruction-following, outperforming prior baselines while using only 10\% of the data.
翻译:监督微调(SFT)是适配大语言模型的基础方法,然而在完整数据集上进行训练会产生高昂成本且收益递减。现有数据选择方法存在严重的领域局限性:针对通用指令跟随任务优化的技术在推理任务上表现不佳,反之亦然。我们观察到,通过测量基础模型与经过最小指令微调的校准模型之间的熵差,可揭示一种规律——微分熵最低的样本在各领域均能产生最优性能,但该原则表现出领域自适应性:推理任务偏好熵增(认知扩展),而通用任务偏好熵减(认知压缩)。我们提出InstructDiff这一统一框架,通过预热校准、双向负对数似然过滤和基于熵的排序,将微分熵实现为领域自适应选择标准。大量实验表明,InstructDiff在数学推理任务上相比全数据训练获得17%的相对性能提升,在通用指令跟随任务上提升52%,仅使用10%数据即超越现有基线方法。