Instruction Fine-Tuning (IFT) significantly enhances the zero-shot capabilities of pretrained Large Language Models (LLMs). While coding data is known to boost LLM reasoning abilities during pretraining, its role in activating internal reasoning capacities during IFT remains understudied. This paper investigates a key question: How does coding data impact LLMs' reasoning capacities during IFT stage? To explore this, we thoroughly examine the impact of coding data across different coding data proportions, model families, sizes, and reasoning domains, from various perspectives. Specifically, we create three IFT datasets with increasing coding data proportions, fine-tune six LLM backbones across different families and scales on these datasets, evaluate the tuned models' performance across twelve tasks in three reasoning domains, and analyze the outcomes from three broad-to-granular perspectives: overall, domain-level, and task-specific. Our holistic analysis provides valuable insights into each perspective. First, coding data tuning enhances the overall reasoning capabilities of LLMs across different model families and scales. Moreover, while the impact of coding data varies by domain, it shows consistent trends within each domain across different model families and scales. Additionally, coding data generally provides comparable task-specific benefits across model families, with optimal proportions in IFT datasets being task-dependent.
翻译:指令微调(IFT)显著增强了预训练大型语言模型(LLMs)的零样本能力。尽管已知编码数据能在预训练阶段提升LLM的推理能力,但其在IFT过程中激活内部推理机制的作用尚未得到充分研究。本文探讨一个核心问题:编码数据如何影响IFT阶段LLMs的推理能力?为此,我们从多维度系统考察了编码数据在不同编码数据比例、模型家族、模型规模及推理领域中的影响。具体而言,我们构建了三种编码数据比例递增的IFT数据集,基于这些数据集对六个不同家族和规模的LLM骨干模型进行微调,在三个推理领域的十二项任务上评估微调模型的性能,并从宏观到微观三个层面(整体表现、领域层面、任务层面)分析结果。我们的综合分析为每个层面提供了重要洞见:首先,编码数据微调能全面提升不同家族和规模LLMs的整体推理能力;其次,尽管编码数据的影响因领域而异,但在同一领域内,其影响趋势在不同模型家族和规模间保持一致;此外,编码数据通常能为不同模型家族带来相当的任务层面增益,而IFT数据集中的最优编码比例则取决于具体任务。