Given the intractably large size of the space of proofs, any model that is capable of general deductive reasoning must generalize to proofs of greater complexity. Recent studies have shown that large language models (LLMs) possess some abstract deductive reasoning ability given chain-of-thought prompts. However, they have primarily been tested on proofs using modus ponens or of a specific size, and from the same distribution as the in-context examples. To measure the general deductive reasoning ability of LLMs, we test on a broad set of deduction rules and measure their ability to generalize to more complex proofs from simpler demonstrations from multiple angles: depth-, width-, and compositional generalization. To facilitate systematic exploration, we construct a new synthetic and programmable reasoning dataset that enables control over deduction rules and proof complexity. Our experiments on four LLMs of various sizes and training objectives show that they are able to generalize to compositional proofs. However, they have difficulty generalizing to longer proofs, and they require explicit demonstrations to produce hypothetical subproofs, specifically in proof by cases and proof by contradiction.
翻译:鉴于证明空间规模难处理地庞大,任何具备通用演绎推理能力的模型都必须能泛化到更复杂的证明。近期研究表明,大型语言模型(LLMs)在链式思维提示下具备一定的抽象演绎推理能力。然而,这些研究主要采用肯定前件推理或特定规模的证明进行测试,且测试数据与上下文示例同分布。为衡量LLMs的通用演绎推理能力,我们基于广泛演绎规则集进行测试,并从深度泛化、宽度泛化和组合泛化三个角度评估其从简单示例向复杂证明的泛化能力。为便于系统探索,我们构建了新型可编程合成推理数据集,可灵活控制演绎规则与证明复杂度。在四种不同规模与训练目标的LLM上的实验表明:模型能够泛化到组合型证明,但在泛化到更长证明时存在困难,且需通过显式演示才能生成假设性子证明——特别是反证法与分情况证明中的子证明。