Linear Reasoning vs. Proof by Cases: Obstacles for Large Language Models in FOL Problem Solving

To comprehensively evaluate the mathematical reasoning capabilities of Large Language Models (LLMs), researchers have introduced abundant mathematical reasoning datasets. However, most existing datasets primarily focus on linear reasoning, neglecting other parts such as proof by contradiction and proof by cases, which are crucial for investigating LLMs' reasoning abilities. To address this limitation, we first introduce a novel first-order logic (FOL) dataset named PC-FOL, annotated by professional mathematicians, focusing on case-based reasoning problems. All instances in this dataset are equipped with a manually written natural language proof, clearly distinguishing it from conventional linear reasoning datasets. Our experimental results over leading LLMs demonstrate a substantial performance gap between linear reasoning and case-based reasoning problems. To further investigate this phenomenon, we provide a theoretical analysis grounded in graphical model, which provides an explanation for the observed disparity between the two types of reasoning problems. We hope this work can reveal the core challenges in the field of automated natural language mathematical proof generation, paving the way for future research.

翻译：为全面评估大语言模型（LLMs）的数学推理能力，研究者已引入大量数学推理数据集。然而，现有数据集主要关注线性推理，忽视了反证法、分情况证明等其他推理形式，而这些对于探究LLMs的推理能力至关重要。为弥补这一不足，我们首先引入一个由专业数学家标注、专注于分情况推理问题的新型一阶逻辑（FOL）数据集PC-FOL。该数据集中的所有实例均配有手工撰写的自然语言证明，从而与传统线性推理数据集形成明确区分。我们在主流LLMs上的实验结果表明，线性推理问题与分情况推理问题之间存在显著的性能差距。为深入探究此现象，我们基于图模型提供了理论分析，为观察到的两类推理问题间的差异提供了解释。我们希望这项工作能够揭示自动自然语言数学证明生成领域的核心挑战，为未来研究铺平道路。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

通过逻辑推理赋能大语言模型：综述

专知会员服务

32+阅读 · 2025年2月24日

大语言模型中的逻辑推理：综述

专知会员服务

48+阅读 · 2025年2月15日

大规模语言模型推理的进展综述

专知会员服务

57+阅读 · 2025年2月8日

《多模态大语言模型时代的数学推理研究：基准、方法与挑战》

专知会员服务

37+阅读 · 2024年12月18日