How Abilities in Large Language Models are Affected by Supervised Fine-tuning Data Composition

Large language models (LLMs) with enormous pre-training tokens and parameters emerge diverse abilities, including math reasoning, code generation, and instruction following. These abilities are further enhanced by supervised fine-tuning (SFT). While the open-source community has explored ad-hoc SFT for enhancing individual capabilities, proprietary LLMs exhibit versatility across various skills. Therefore, understanding the facilitation of multiple abilities via SFT is paramount. In this study, we specifically focuses on the interplay of data composition between mathematical reasoning, code generation, and general human-aligning abilities during SFT. We propose four intriguing research questions to explore the association between model performance and various factors including data amount, composition ratio, model size and SFT strategies. Our experiments reveal that distinct capabilities scale differently and larger models generally show superior performance with same amount of data. Mathematical reasoning and code generation consistently improve with increasing data amount, whereas general abilities plateau after roughly a thousand samples. Moreover, we observe data composition appears to enhance various abilities under limited data conditions, yet can lead to performance conflicts when data is plentiful. Our findings also suggest the amount of composition data influences performance more than the composition ratio. In analysis of SFT strategies, we find that sequentially learning multiple skills risks catastrophic forgetting. Our proposed Dual-stage Mixed Fine-tuning (DMT) strategy offers a promising solution to learn multiple abilities with different scaling patterns.

翻译：拥有海量预训练标记和参数的大规模语言模型（LLMs）涌现出多样化的能力，包括数学推理、代码生成和指令遵循。这些能力通过监督微调（SFT）得到进一步增强。尽管开源社区已探索通过特定SFT来增强单项能力，但专有LLMs展现出跨多种技能的多功能性。因此，理解通过SFT促进多重能力的机制至关重要。本研究特别关注SFT过程中数学推理、代码生成与通用人类对齐能力之间的数据构成交互作用。我们提出四个关键研究问题，以探索模型性能与数据量、构成比例、模型规模和SFT策略等多种因素之间的关联。实验表明：不同能力具有差异化的扩展规律，相同数据量下较大模型通常表现更优；数学推理与代码生成能力随数据量增加持续提升，而通用能力在约千个样本后趋于饱和；数据构成在有限数据条件下能增强各项能力，但在数据充足时可能引发性能冲突；构成数据量对性能的影响比构成比例更为显著。在SFT策略分析中，我们发现顺序学习多项技能易导致灾难性遗忘。我们提出的双阶段混合微调（DMT）策略为学习具有不同扩展模式的多重能力提供了可行解决方案。