Large language models (LLMs) often achieve strong performance on reasoning benchmarks, but final-answer accuracy alone does not show whether they faithfully execute the procedure specified in a prompt. We study this question through a controlled diagnostic benchmark for procedural execution, where models are given a step-wise arithmetic algorithm and two numeric inputs, and must return the final computed value. The benchmark uses simple arithmetic operations but increases complexity through algorithm length and look-back dependencies over intermediate variables. Across 14 models and 55 datasets, average first-answer accuracy drops from 61% on 5-step procedures to 20% on 95-step procedures. Generation-level analysis shows that failures often involve missing answers, premature answers, self-correction after an initial error, under-executed traces, and hallucinated extra steps. These findings suggest that apparent reasoning ability can mask substantial weaknesses in faithful instruction execution.
翻译:大语言模型(LLM)在推理基准测试中通常表现优异,但仅凭最终答案的准确性并不能反映模型是否忠实执行了提示中指定的程序。我们通过一个受控的程序化执行诊断基准来研究这一问题:该基准要求模型按照逐步算术算法处理两个数值输入,并返回最终计算结果。基准采用简单算术运算,但通过算法长度和中间变量的回溯依赖关系增加复杂性。在14个模型和55个数据集上的测试显示,当程序步骤从5步增加到95步时,平均首次回答准确率从61%下降到20%。生成级分析表明,失败形式通常包括缺失答案、过早输出答案、初始错误后的自我修正、执行痕迹不完整以及虚构额外步骤。这些发现表明,表面上的推理能力可能掩盖了忠实指令执行方面的重大缺陷。