Instruction-tuned large language models (IT-LLMs) exhibit strong zero-shot reasoning, yet their ability to execute simple, self-contained instructions remains underexplored, despite this being foundational to complex instruction-following. We evaluate 20 IT-LLMs on modified MMLU and MMLU-Pro benchmarks, by systematically varying the format of option labels (alphabetic, numeric, Roman) while keeping their meaning identical under four paradigms, namely: (1) With explicit instructions, label changes cause large performance shifts (e.g., -30.45\% for Roman vs. numeric), revealing instruction-format bias. (2) Without instructions, performance drops further (up to -10.84\%) and label sensitivity intensifies, underscoring the role of explicit guidance. (3) When option contents are removed, models fail random-choice baselines except with numeric labels, suggesting weak adherence to atomic directives. (4) Three-shot exemplars yield no significant gains in robustness or fidelity, and generation analyses show persistent label errors, especially for non-numeric formats. Across model sizes, larger LLMs achieve higher accuracy but remain inconsistent in instruction adherence. These results expose the insufficiencies of current instruction-tuning paradigms and highlight the need for evaluation methods and training strategies that explicitly target atomic instruction-following.
翻译:指令微调大语言模型(IT-LLMs)展现出强大的零样本推理能力,但其执行简单自包含指令的能力仍未得到充分探索,尽管这是复杂指令遵循的基础。我们在改进的MMLU和MMLU-Pro基准上评估了20个IT-LLMs,通过系统性地改变选项标签格式(字母、数字、罗马数字)同时保持其语义在四种范式下完全一致:(1)使用显式指令时,标签格式变化导致性能大幅波动(例如罗马数字与数字标签相比下降30.45%),揭示了指令格式偏差。(2)移除指令后,性能进一步下降(最高达10.84%)且标签敏感性加剧,凸显了显式引导的作用。(3)当选项内容被移除时,除数字标签外模型表现均低于随机选择基线,表明对原子指令的遵循能力薄弱。(4)三样本示例未能显著提升鲁棒性或保真度,生成分析显示持续存在的标签错误在非数字格式中尤为突出。在不同模型规模中,更大规模的LLMs获得更高准确率,但在指令遵循一致性方面仍存在不足。这些结果揭示了当前指令微调范式的缺陷,并强调需要针对原子指令遵循能力开发专门的评估方法和训练策略。