Information-based data selection for instruction tuning is compelling: maximizing the log-determinant of the Fisher information yields a monotone submodular objective, enabling greedy algorithms to achieve a $(1-1/e)$ approximation under a cardinality budget. In practice, however, we identify alleviating gradient conflicts, misalignment between per-sample gradients, is a key factor that slows down the decay of marginal log-determinant information gains, thereby preventing significant loss of information. We formalize this via an $\varepsilon$-decomposition that quantifies the deviation from ideal submodularity as a function of conflict statistics, yielding data-dependent approximation factors that tighten as conflicts diminish. Guided by this analysis, we propose SPICE, a conflict-aware selector that maximizes information while penalizing misalignment, and that supports early stopping and proxy models for efficiency. Empirically, SPICE selects subsets with higher log-determinant information than original criteria, and these informational gains translate into performance improvements: across 8 benchmarks with LLaMA2-7B and Qwen2-7B, SPICE uses only 10% of the data, yet matches or exceeds 6 methods including full-data tuning. This achieves performance improvements with substantially lower training cost.
翻译:基于信息度量的指令微调数据选择方法具有显著优势:最大化Fisher信息矩阵对数行列式可得到单调子模目标函数,使得贪心算法在基数约束下能达到$(1-1/e)$的近似比。然而在实际应用中,我们发现缓解梯度冲突(即样本间梯度的失配现象)是减缓边际对数行列式信息增益衰减的关键因素,从而避免信息量的显著损失。我们通过$\varepsilon$-分解对此进行形式化建模,将理想子模性的偏离量化为冲突统计量的函数,得到随冲突减弱而收紧的数据依赖型近似因子。基于此分析,我们提出SPICE——一种在最大化信息量的同时惩罚梯度失配的冲突感知选择器,该框架支持早停机制与代理模型以提升效率。实验表明,SPICE所选子集相比原始准则具有更高的对数行列式信息量,且这种信息增益能转化为性能提升:在LLaMA2-7B与Qwen2-7B的8个基准测试中,SPICE仅使用10%数据即可匹配或超越包括全数据微调在内的6种方法,在显著降低训练成本的同时实现了性能改进。