How Abilities in Large Language Models are Affected by Supervised Fine-tuning Data Composition

Large language models (LLMs) with enormous pre-training tokens and parameters emerge diverse abilities, including math reasoning, code generation, and instruction following. These abilities are further enhanced by supervised fine-tuning (SFT). While the open-source community has explored ad-hoc SFT for enhancing individual capabilities, proprietary LLMs exhibit versatility across various skills. Therefore, understanding the facilitation of multiple abilities via SFT is paramount. In this study, we specifically focuses on the interplay of data composition between mathematical reasoning, code generation, and general human-aligning abilities during SFT. We propose four intriguing research questions to explore the association between model performance and various factors including data amount, composition ratio, model size and SFT strategies. Our experiments reveal that distinct capabilities scale differently and larger models generally show superior performance with same amount of data. Mathematical reasoning and code generation consistently improve with increasing data amount, whereas general abilities plateau after roughly a thousand samples. Moreover, we observe data composition appears to enhance various abilities under limited data conditions, yet can lead to performance conflicts when data is plentiful. Our findings also suggest the amount of composition data influences performance more than the composition ratio. In analysis of SFT strategies, we find that sequentially learning multiple skills risks catastrophic forgetting. Our proposed Dual-stage Mixed Fine-tuning (DMT) strategy offers a promising solution to learn multiple abilities with different scaling patterns.

翻译：大语言模型凭借海量预训练令牌和参数展现出多样化能力，包括数学推理、代码生成与指令遵循。这些能力通过监督微调得以进一步增强。尽管开源社区已针对特定能力提升探索了专项监督微调方案，但商用大语言模型在多项技能上均具有通用性。因此，理解监督微调如何促进多能力协同至关重要。本研究聚焦数学推理、代码生成与通用人机对齐能力在监督微调过程中的数据组成交互关系。我们提出四个研究问题，探究模型性能与数据量、组成比例、模型规模及微调策略等要素间的关联。实验表明：不同能力的扩展规律存在差异，同等数据量下更大规模模型普遍表现更优；数学推理与代码生成能力随数据量递增持续提升，而通用能力在约千个样本后趋于饱和。此外，我们发现数据组成在有限数据场景下可增强多种能力，但数据充裕时可能引发性能冲突。研究还表明，组成数据量对性能的影响优于组成比例。在微调策略分析中，我们发现顺序学习多种技能存在灾难性遗忘风险。我们提出的两阶段混合微调策略为差异化扩展模式下的多技能学习提供了有效解决方案。