Given the intractably large size of the space of proofs, any model that is capable of general deductive reasoning must generalize to proofs of greater complexity. Recent studies have shown that large language models (LLMs) possess some abstract deductive reasoning ability given chain-of-thought prompts. However, they have primarily been tested on proofs using modus ponens or of a specific size, and from the same distribution as the in-context examples. To measure the general deductive reasoning ability of LLMs, we test on a broad set of deduction rules and measure their ability to generalize to more complex proofs from simpler demonstrations from multiple angles: depth-, width-, and compositional generalization. To facilitate systematic exploration, we construct a new synthetic and programmable reasoning dataset that enables control over deduction rules and proof complexity. Our experiments on four LLMs of various sizes and training objectives show that they are able to generalize to longer and compositional proofs. However, they require explicit demonstrations to produce hypothetical subproofs, specifically in proof by cases and proof by contradiction.
翻译:鉴于证明空间在理论上的巨大规模,任何具备通用演绎推理能力的模型都必须能够泛化到更复杂的证明。近期研究表明,大型语言模型(LLM)在链式思维提示下具备一定的抽象演绎推理能力。然而,现有测试主要针对使用假言推理或特定规模的证明,且测试样本与上下文示例同分布。为衡量LLM的通用演绎推理能力,我们在广泛演绎规则集上开展测试,并从多个维度(深度泛化、宽度泛化与组合泛化)评估其从简单示例向复杂证明的泛化能力。为便于系统性研究,我们构建了新型可编程合成推理数据集,能够灵活控制演绎规则与证明复杂度。对四种不同规模与训练目标LLM的实验表明,模型能够泛化至更长与组合形式的证明。但生成假设性子证明(特别是分情形证明与反证法)时,模型仍需显式示例引导。