Certified Deductive Reasoning with Language Models

Language models often achieve higher accuracy when reasoning step-by-step in complex tasks. However, even when arriving at a correct final answer, their rationales are often logically unsound or inconsistent. This is a major issue when reliable reasoning traces are needed, such when fine-tuning on model-generated reasoning for self-improvement. To tackle these issues, we introduce a class of tools for language models called \emph{guides}, that use state and incremental constraints to guide generation. A guide can be invoked by the model to constrain its own generation to a set of valid statements given by the tool. In turn, the model's choices can change the guide's state. We show how a general system for logical reasoning can be used as a guide, which we call \textsc{LogicGuide}. Given a reasoning problem in natural language, a model can formalize its assumptions for \textsc{LogicGuide} and guarantee that its step-by-step reasoning is sound. In experiments on PrOntoQA, ProofWriter and Syllogism Validity datasets, \textsc{LogicGuide} significantly improves the performance of GPT-3, GPT-3.5 Turbo and LLaMA (accuracy gains up to 35\%), while drastically reducing \emph{content effects} -- the interference between unwanted prior assumptions and reasoning, which humans and language models suffer from. We then explore bootstrapping GPT-3.5 Turbo and LLaMA using their own reasoning traces. We find that LogicGuide is critical: by training only on certified self-generated reasoning, models can self-improve, avoiding learning from their own hallucinations. Moreover, bootstrapped models enjoy significant boosts on ReClor, a challenging real-world reasoning dataset, even when not relying on formalization at inference time.

翻译：语言模型在复杂任务中通过逐步推理常能取得更高准确率，但其推理过程往往存在逻辑不严谨或不一致的问题——即使最终答案正确，这一缺陷在需要可靠推理轨迹的场景（如基于模型自生成推理进行自我改进的微调）中尤为突出。为解决这些问题，我们提出一类名为"引导器"（guides）的语言模型工具，通过状态约束与增量约束引导生成过程。模型可主动调用引导器，将自身生成内容约束在工具提供的有效陈述集合内；同时，模型的选择也能改变引导器的状态。我们展示了如何将通用逻辑推理系统（称为LogicGuide）作为引导器：针对自然语言描述的推理问题，模型可为LogicGuide形式化其假设，从而保证逐步推理的严密性。在PrOntoQA、ProofWriter与Syllogism Validity数据集上的实验表明，LogicGuide显著提升了GPT-3、GPT-3.5 Turbo及LLaMA的性能（准确率最高提升35%），同时大幅削弱了人类与语言模型共有的"内容效应"——即非预期先验假设对推理的干扰。我们进一步探索了利用GPT-3.5 Turbo与LLaMA自身推理轨迹进行自举训练的方法，发现LogicGuide至关重要：仅基于经过认证的自生成推理进行训练，模型即可实现自我改进，避免从自身幻觉中学习。此外，经过自举训练的模型在挑战性现实推理数据集ReClor上取得了显著性能提升，即使推理时不依赖形式化约束也表现优异。