Chain-of-Thought (CoT) reasoning has advanced the capabilities and transparency of language models (LMs); however, reasoning chains can contain inaccurate statements that reduce performance and trustworthiness. To address this, we propose to augment each reasoning step in a CoT with a latent veracity (or correctness) variable. To efficiently explore this expanded space, we introduce Veracity Search (VS), a discrete search algorithm over veracity assignments. It performs otherwise intractable inference in the posterior distribution over latent veracity values by leveraging the LM's joint likelihood over veracity and the final answer as a proxy reward. This efficient inference-time verification method facilitates supervised fine-tuning of an Amortized Veracity Inference (AVI) machine by providing pseudo-labels for veracity. AVI generalizes VS, enabling accurate zero-shot veracity inference in novel contexts. Empirical results demonstrate that VS reliably identifies errors in logical (ProntoQA), mathematical (GSM8K), and commonsense (CommonsenseQA) reasoning benchmarks, with AVI achieving comparable zero-shot accuracy. Finally, we demonstrate the utility of latent veracity inference for providing feedback during self-correction and self-improvement.
翻译:思维链(CoT)推理提升了语言模型(LM)的能力与透明度;然而,推理链中可能包含不准确的陈述,从而降低性能与可信度。为解决此问题,我们提出为CoT中的每个推理步骤增设一个潜在真实性(或正确性)变量。为高效探索这一扩展空间,我们引入了真实性搜索(VS),一种基于真实性赋值的离散搜索算法。该算法通过利用语言模型对真实性与最终答案的联合似然作为代理奖励,在潜在真实性值的后验分布中执行原本难以处理的推断。这种高效的推断时验证方法通过为真实性提供伪标签,促进了摊销真实性推断(AVI)机器的监督微调。AVI推广了VS,使其能够在新的语境中实现准确的零样本真实性推断。实证结果表明,VS在逻辑推理(ProntoQA)、数学推理(GSM8K)和常识推理(CommonsenseQA)基准测试中均能可靠地识别错误,且AVI实现了可比的零样本准确率。最后,我们展示了潜在真实性推断在自我校正与自我改进过程中提供反馈的实用性。