Current language model training commonly applies multi-task Supervised Fine-Tuning (SFT) using a homogeneous compute budget across all sub-datasets. This approach is fundamentally sub-optimal: heterogeneous learning dynamics cause faster-learning tasks to overfit early while slower ones remain under-fitted. To address this, we introduce mSFT, an iterative, overfitting-aware search algorithm for multi-task data mixtures. mSFT trains the model on an active mixture, identifies and excludes the earliest overfitting sub-dataset, and reverts to that specific optimal checkpoint before continuing. Extensive evaluations demonstrate that mSFT consistently outperforms 4 baselines across 10 benchmarks and 6 base models. Further analysis confirms mSFT maintains robust gains across diverse dataset sizes, task granularities, and is insensitive to its single new hyperparameter (compute budget). Notably, at low compute budget, mSFT can improve performance while lowering training FLOPs. Ultimately, mSFT establishes a practical overfitting-aware algorithm for multi-task SFT that maximizes the potential of models across diverse data mixtures.
翻译:当前语言模型训练普遍采用同质化计算预算对所有子数据集进行多任务监督微调(SFT)。这种方法本质上是次优的:异构学习动态导致学习速度较快的任务过早过拟合,而较慢的任务仍处于欠拟合状态。为解决这一问题,我们提出mSFT——一种面向多任务数据混合的迭代式过拟合感知搜索算法。mSFT在活动混合数据集上训练模型,识别并排除最早过拟合的子数据集,然后回退至该特定最优检查点继续训练。广泛评估表明,mSFT在10个基准测试和6个基座模型上持续优于4个基线方法。进一步分析证实,mSFT在不同数据集规模和任务粒度下均保持稳健收益,且对其新增的单一超参数(计算预算)不敏感。值得注意的是,在低计算预算下,mSFT可在提升性能的同时降低训练FLOPs。最终,mSFT为多任务SFT建立了一种实用的过拟合感知算法,能够跨不同数据混合最大程度发挥模型潜力。