Modern language models (LMs) exhibit strong deductive reasoning capabilities, yet standard evaluations emphasize correctness while overlooking a key aspect of reasoning: efficiency. In real-world reasoning scenarios, much of the available information is irrelevant, and effective deductive inference requires identifying and ignoring such distractions. We propose a framework for assessing LM reasoning efficiency through the lens of logic programming, introducing a simple method to align proofs written in natural language -- as generated by an LM -- with shortest proofs found by executing the logic program. Efficiency is quantified by measuring how well a model avoids unnecessary inference. Empirically, we construct a dataset of math word problems injected with various number of irrelevant axioms that vary in semantic overlap with the goal theorem. We find that current LMs show marked accuracy declines under such conditions -- even with minimal, domain-consistent distractions -- and the proofs they generate frequently exhibit detours through irrelevant inferences.
翻译:现代语言模型展现出强大的演绎推理能力,然而标准评估方法仅关注正确性,却忽视了推理的一个关键维度:效率。在实际推理场景中,大量可用信息往往与问题无关,而有效的演绎推理需要识别并忽略此类干扰信息。我们提出一个通过逻辑编程视角评估语言模型推理效率的框架,引入一种将语言模型生成的自然语言证明与逻辑程序执行所得最短证明进行对齐的简易方法。效率通过量化模型避免不必要推理的程度来评估。在实证研究中,我们构建了一个数学应用题数据集,其中注入了不同数量的无关公理,这些公理与目标定理的语义重叠程度各不相同。研究发现,当前语言模型在此类条件下表现出显著的准确率下降——即使面对极少量的领域一致性干扰——且其生成的证明频繁出现绕经无关推理的冗余路径。