To comprehensively evaluate the mathematical reasoning capabilities of Large Language Models (LLMs), researchers have introduced abundant mathematical reasoning datasets. However, most existing datasets primarily focus on linear reasoning, neglecting other parts such as proof by contradiction and proof by cases, which are crucial for investigating LLMs' reasoning abilities. To address this limitation, we first introduce a novel first-order logic (FOL) dataset named PC-FOL, annotated by professional mathematicians, focusing on case-based reasoning problems. All instances in this dataset are equipped with a manually written natural language proof, clearly distinguishing it from conventional linear reasoning datasets. Our experimental results over leading LLMs demonstrate a substantial performance gap between linear reasoning and case-based reasoning problems. To further investigate this phenomenon, we provide a theoretical analysis grounded in graphical model, which provides an explanation for the observed disparity between the two types of reasoning problems. We hope this work can reveal the core challenges in the field of automated natural language mathematical proof generation, paving the way for future research.
翻译:为全面评估大语言模型(LLMs)的数学推理能力,研究者已引入大量数学推理数据集。然而,现有数据集主要关注线性推理,忽视了反证法、分情况证明等其他推理形式,而这些对于探究LLMs的推理能力至关重要。为弥补这一不足,我们首先引入一个由专业数学家标注、专注于分情况推理问题的新型一阶逻辑(FOL)数据集PC-FOL。该数据集中的所有实例均配有手工撰写的自然语言证明,从而与传统线性推理数据集形成明确区分。我们在主流LLMs上的实验结果表明,线性推理问题与分情况推理问题之间存在显著的性能差距。为深入探究此现象,我们基于图模型提供了理论分析,为观察到的两类推理问题间的差异提供了解释。我们希望这项工作能够揭示自动自然语言数学证明生成领域的核心挑战,为未来研究铺平道路。