Large language models (LLMs) are capable of solving a wide range of tasks, yet they have struggled with reasoning. To address this, we propose $\textbf{Additional Logic Training (ALT)}$, which aims to enhance LLMs' reasoning capabilities by program-generated logical reasoning samples. We first establish principles for designing high-quality samples by integrating symbolic logic theory and previous empirical insights. Then, based on these principles, we construct a synthetic corpus named $\textbf{Formal Logic Deduction Diverse}$ ($\textbf{FLD}$$^{\times 2}$), comprising numerous samples of multi-step deduction with unknown facts, diverse reasoning rules, diverse linguistic expressions, and challenging distractors. Finally, we empirically show that ALT on FLD$^{\times2}$ substantially enhances the reasoning capabilities of state-of-the-art LLMs, including LLaMA-3.1-70B. Improvements include gains of up to 30 points on logical reasoning benchmarks, up to 10 points on math and coding benchmarks, and 5 points on the benchmark suite BBH.
翻译:大语言模型(LLMs)能够解决广泛的任务,但在推理方面仍存在困难。为解决此问题,我们提出 **附加逻辑训练(ALT)**,旨在通过程序生成的逻辑推理样本来增强大语言模型的推理能力。我们首先整合符号逻辑理论和先前的实证见解,确立了设计高质量样本的原则。随后,基于这些原则,我们构建了一个名为 **形式逻辑演绎多样性(FLD$^{\times 2}$)** 的合成语料库,其中包含大量具有未知事实、多样化推理规则、多样化语言表达以及具有挑战性干扰项的多步演绎样本。最后,我们通过实证表明,在 FLD$^{\times 2}$ 上进行的 ALT 显著增强了包括 LLaMA-3.1-70B 在内的前沿大语言模型的推理能力。提升效果包括:在逻辑推理基准测试中最高获得 30 分的增益,在数学和编程基准测试中最高获得 10 分的增益,以及在基准测试套件 BBH 上获得 5 分的增益。