Supervised fine-tuning (SFT) is a commonly used technique to adapt large language models (LLMs) to downstream tasks. In practice, SFT on a full dataset is computationally expensive and sometimes suffers from overfitting or bias amplification. This facilitates the rise of data curation in SFT, which prioritizes the most valuable data to optimze. This work studies the online batch selection family that dynamically scores and filters samples during the training process. However, existing popular methods often (i) rely merely on the utility of data to select a subset while neglecting other crucial factors like diversity, (ii) rely on external resources such as reference models or validation sets, and (iii) incur extra training time over full-dataset training. To address these limitations, this work develops UDS (Utility-Diversity Sampling), a framework for efficient online batch selection in SFT. UDS leverages the nuclear norm of the logits matrix to capture both data utility and intra-sample diversity, while estimating inter-sample diversity through efficient low-dimensional embedding comparisons with a lightweight memory buffer of historical samples. Such a design eliminates the need for external resources and unnecessary backpropagation, securing computational efficiency. Experiments on multiple benchmarks demonstrate that UDS consistently outperforms state-of-the-art online batch selection methods under varying data budgets, and significantly reduces training time compared to full-dataset fine-tuning. Code is available at https://github.com/gfyddha/UDS.
翻译:监督微调(SFT)是使大语言模型(LLMs)适应下游任务的常用技术。实践中,在全数据集上进行SFT计算成本高昂,且常面临过拟合或偏差放大问题。这推动了SFT中数据筛选策略的发展——优先选择最有价值的数据进行优化。本文研究在线批量选择方法,该类方法在训练过程中动态评估并过滤样本。然而,现有主流方法往往:(i)仅依赖数据效用选择子集而忽视多样性等关键因素,(ii)依赖外部资源(如参考模型或验证集),(iii)较全数据集训练额外增加训练时间。为解决上述局限,本文提出UDS(效用-多样性采样)框架,用于SFT的高效在线批量选择。UDS利用对数几率矩阵的核范数同时捕获数据效用与样本内多样性,并通过轻量级内存缓冲存储历史样本,经高效低维嵌入比较估算样本间多样性。该设计无需外部资源与冗余反向传播,保障计算效率。多基准测试表明,UDS在不同数据预算下持续优于现有最优在线批量选择方法,且相较全数据集微调显著缩短训练时间。代码开源于 https://github.com/gfyddha/UDS。