Instruction finetuning is standard practice for improving LLM performance, yet it remains unclear whether it enhances reasoning or merely induces surface-level pattern matching. We investigate this by evaluating base and instruction-tuned models on standard math benchmarks, structurally perturbed variants, and domain-shifted tasks. Our analysis highlights two key (often overlooked) limitations of instruction tuning. First, the performance advantage is unstable and depends heavily on evaluation settings. In zero-shot CoT settings on GSM8K, base models consistently outperform instruction-tuned variants, with drops as high as 32.67\% (Llama3-70B). Instruction-tuned models only match or exceed this performance when provided with few-shot exemplars, suggesting a reliance on specific prompting patterns rather than intrinsic reasoning. Second, tuning gains are brittle under distribution shift. Our results show that base models surpass instruction-tuned variants on the domain-specific MedCalc benchmark. Additionally, instruction-tuned models show sharp declines on perturbed datasets, indicating sensitivity to prompt structure over robust reasoning.
翻译:指令微调是提升大型语言模型性能的标准实践,然而尚不清楚它究竟是增强了推理能力,还是仅仅诱导了浅层的模式匹配。我们通过评估基础模型与指令调优模型在标准数学基准、结构扰动变体以及领域迁移任务上的表现来探究这一问题。我们的分析揭示了指令调优两个关键(常被忽视的)局限性。首先,其性能优势并不稳定,且高度依赖于评估设置。在GSM8K的零样本思维链设置中,基础模型持续优于指令调优变体,性能下降幅度高达32.67%(Llama3-70B)。指令调优模型仅在提供少量示例时才能匹配或超越此性能,这表明其依赖特定的提示模式而非内在的推理能力。其次,调优带来的增益在分布迁移下是脆弱的。我们的结果表明,在特定领域基准MedCalc上,基础模型超越了指令调优变体。此外,指令调优模型在扰动数据集上表现出急剧的性能下降,这揭示了其对提示结构的敏感性超过了稳健的推理能力。