Online Learnability of Chain-of-Thought Verifiers: Soundness and Completeness Trade-offs

Large Language Models (LLMs) with chain-of-thought generation have demonstrated great potential for solving complex reasoning and planning tasks. However, the output of current LLMs is not fully reliable and needs careful verification. Even if LLMs get more accurate over time, learned verifiers can help increase trust, enforce safety constraints, and ensure alignment with personal preferences. A major challenge in learning verifiers, however, especially when their output will be used by the generator to improve its reasoning, is that the feedback loop between generator and verifier may produce substantial distribution shift. Motivated by this challenge, we propose an online learning framework for learning chain-of-thought verifiers that, given a problem and a sequence of reasoning steps, check the correctness of the solution. Highlighting the asymmetric role of soundness errors (failure in catching errors in a reasoning trace) and completeness errors (flagging correct reasoning steps as wrong), we introduce novel extensions of the Littlestone dimension which tightly characterize the mistake bounds for learning a verifier in the realizable setting. We provide optimal algorithms for finding the Pareto-frontier (the smallest total number of mistakes given a budget of soundness mistakes) as well as for minimizing a linear combination of asymmetric costs. We further show how our learned verifiers can be used to boost the accuracy of a collection of weak generators, and enable generation of proofs beyond what they were initially trained on. With the mild assumption that one of the generators can generate the next reasoning step correctly with some minimal probability, we show how to learn a strong generator with small error and abstention rates.

翻译：具备链式推理能力的大型语言模型在解决复杂推理和规划任务方面展现出巨大潜力。然而，当前大型语言模型的输出并非完全可靠，需要仔细验证。即使模型随时间推移变得更加准确，学习型验证器仍有助于增强信任、强制执行安全约束并确保与个人偏好的一致性。学习验证器面临的主要挑战在于，当其输出将被生成器用于改进推理时，生成器与验证器之间的反馈循环可能产生显著的分布偏移。基于这一挑战，我们提出了一种在线学习框架，用于学习链式推理验证器，该验证器在给定问题和推理步骤序列时检查解决方案的正确性。通过突出健全性错误（未能捕获推理轨迹中的错误）与完备性错误（将正确的推理步骤标记为错误）之间的非对称性，我们提出了利特尔斯通维度的新型扩展，该维度严格刻画了在可实现设置中学习验证器的错误界。我们提供了最优算法，用于寻找帕累托前沿（在给定健全性错误预算下最小化总错误数）以及最小化非对称成本的线性组合。我们进一步展示了所学验证器如何用于提升一组弱生成器的准确率，并使其能够生成超出初始训练范围的证明。基于一个温和假设（即某个生成器能够以最小概率正确生成下一个推理步骤），我们展示了如何以较低的错误率和弃权率学习一个强生成器。

相关内容

生成器

关注 2

生成器是一次生成一个值的特殊类型函数。可以将其视为可恢复函数。调用该函数将返回一个可用于生成连续 x 值的生成【Generator】，简单的说就是在函数的执行过程中，yield语句会把你需要的值返回给调用生成器的地方，然后退出函数，下一次调用生成器函数的时候又从上次中断的地方开始执行，而生成器内的所有变量参数都会被保存下来供下一次使用。

《潜在推理综述》

专知会员服务

21+阅读 · 2025年7月9日

超越语言的推理：潜在思维链推理的综合综述

专知会员服务

22+阅读 · 2025年5月23日

《大型推理模型的安全性：综述》

专知会员服务

24+阅读 · 2025年4月25日

【博士论文】《用于可验证数学自动化的语言模型：交互、集成与自动形式化》

专知会员服务

19+阅读 · 2025年3月14日