Large language models (LLMs) are capable of solving a wide range of tasks, yet they have struggled with reasoning. To address this, we propose $\textbf{Additional Logic Training (ALT)}$, which aims to enhance LLMs' reasoning capabilities by program-generated logical reasoning samples. We first establish principles for designing high-quality samples by integrating symbolic logic theory and previous empirical insights. Then, based on these principles, we construct a synthetic corpus named $\textbf{Formal Logic Deduction Diverse}$ ($\textbf{FLD}$$_{\times 2}$), comprising numerous samples of multi-step deduction with unknown facts, diverse reasoning rules, diverse linguistic expressions, and challenging distractors. Finally, we empirically show that ALT on FLD$_{\times2}$ substantially enhances the reasoning capabilities of state-of-the-art LLMs, including LLaMA-3.1-70B. Improvements include gains of up to 30 points on logical reasoning benchmarks, up to 10 points on math and coding benchmarks, and 5 points on the benchmark suite BBH.
翻译:大语言模型(LLMs)能够解决广泛的任务,但在推理方面仍存在困难。为解决此问题,我们提出**附加逻辑训练(ALT)**,旨在通过程序生成的逻辑推理样本来增强LLMs的推理能力。我们首先整合符号逻辑理论和先前的实证见解,确立了设计高质量样本的原则。基于这些原则,我们构建了一个名为**形式逻辑演绎多样性**(**FLD**$_{\times 2}$)的合成语料库,其中包含大量具有未知事实、多样化推理规则、多样化语言表达以及具有挑战性干扰项的多步演绎样本。最后,我们通过实证表明,在FLD$_{\times2}$上进行的ALT显著增强了包括LLaMA-3.1-70B在内的前沿LLMs的推理能力。提升效果包括:在逻辑推理基准测试中最高获得30分的增益,在数学和编程基准测试中最高获得10分的增益,以及在BBH基准测试套件中获得5分的增益。