Language models often achieve higher accuracy when reasoning step-by-step in complex tasks. However, even when arriving at a correct final answer, their rationales are often logically unsound or inconsistent. This is a major issue when reliable reasoning traces are needed, such when fine-tuning on model-generated reasoning for self-improvement. To tackle these issues, we introduce a class of tools for language models called \emph{guides}, that use state and incremental constraints to guide generation. A guide can be invoked by the model to constrain its own generation to a set of valid statements given by the tool. In turn, the model's choices can change the guide's state. We show how a general system for logical reasoning can be used as a guide, which we call \textsc{LogicGuide}. Given a reasoning problem in natural language, a model can formalize its assumptions for \textsc{LogicGuide} and guarantee that its step-by-step reasoning is sound. In experiments on PrOntoQA, ProofWriter and Syllogism Validity datasets, \textsc{LogicGuide} significantly improves the performance of GPT-3, GPT-3.5 Turbo and LLaMA (accuracy gains up to 35\%), while drastically reducing \emph{content effects} -- the interference between unwanted prior assumptions and reasoning, which humans and language models suffer from. We then explore bootstrapping GPT-3.5 Turbo and LLaMA using their own reasoning traces. We find that LogicGuide is critical: by training only on certified self-generated reasoning, models can self-improve, avoiding learning from their own hallucinations. Moreover, bootstrapped models enjoy significant boosts on ReClor, a challenging real-world reasoning dataset, even when not relying on formalization at inference time.
翻译:语言模型在复杂任务中通过逐步推理常能取得更高准确率,但其推理过程往往存在逻辑不严谨或不一致的问题——即使最终答案正确,这一缺陷在需要可靠推理轨迹的场景(如基于模型自生成推理进行自我改进的微调)中尤为突出。为解决这些问题,我们提出一类名为"引导器"(guides)的语言模型工具,通过状态约束与增量约束引导生成过程。模型可主动调用引导器,将自身生成内容约束在工具提供的有效陈述集合内;同时,模型的选择也能改变引导器的状态。我们展示了如何将通用逻辑推理系统(称为LogicGuide)作为引导器:针对自然语言描述的推理问题,模型可为LogicGuide形式化其假设,从而保证逐步推理的严密性。在PrOntoQA、ProofWriter与Syllogism Validity数据集上的实验表明,LogicGuide显著提升了GPT-3、GPT-3.5 Turbo及LLaMA的性能(准确率最高提升35%),同时大幅削弱了人类与语言模型共有的"内容效应"——即非预期先验假设对推理的干扰。我们进一步探索了利用GPT-3.5 Turbo与LLaMA自身推理轨迹进行自举训练的方法,发现LogicGuide至关重要:仅基于经过认证的自生成推理进行训练,模型即可实现自我改进,避免从自身幻觉中学习。此外,经过自举训练的模型在挑战性现实推理数据集ReClor上取得了显著性能提升,即使推理时不依赖形式化约束也表现优异。