Large Language Models (LLMs) with chain-of-thought generation have demonstrated great potential for solving complex reasoning and planning tasks. However, the output of current LLMs is not fully reliable and needs careful verification. Even if LLMs get more accurate over time, learned verifiers can help increase trust, enforce safety constraints, and ensure alignment with personal preferences. A major challenge in learning verifiers, however, especially when their output will be used by the generator to improve its reasoning, is that the feedback loop between generator and verifier may produce substantial distribution shift. Motivated by this challenge, we propose an online learning framework for learning chain-of-thought verifiers that, given a problem and a sequence of reasoning steps, check the correctness of the solution. Highlighting the asymmetric role of soundness errors (failure in catching errors in a reasoning trace) and completeness errors (flagging correct reasoning steps as wrong), we introduce novel extensions of the Littlestone dimension which tightly characterize the mistake bounds for learning a verifier in the realizable setting. We provide optimal algorithms for finding the Pareto-frontier (the smallest total number of mistakes given a budget of soundness mistakes) as well as for minimizing a linear combination of asymmetric costs. We further show how our learned verifiers can be used to boost the accuracy of a collection of weak generators, and enable generation of proofs beyond what they were initially trained on. With the mild assumption that one of the generators can generate the next reasoning step correctly with some minimal probability, we show how to learn a strong generator with small error and abstention rates.
翻译:具备链式推理能力的大型语言模型在解决复杂推理和规划任务方面展现出巨大潜力。然而,当前大型语言模型的输出并非完全可靠,需要仔细验证。即使模型随时间推移变得更加准确,学习型验证器仍有助于增强信任、强制执行安全约束并确保与个人偏好的一致性。学习验证器面临的主要挑战在于,当其输出将被生成器用于改进推理时,生成器与验证器之间的反馈循环可能产生显著的分布偏移。基于这一挑战,我们提出了一种在线学习框架,用于学习链式推理验证器,该验证器在给定问题和推理步骤序列时检查解决方案的正确性。通过突出健全性错误(未能捕获推理轨迹中的错误)与完备性错误(将正确的推理步骤标记为错误)之间的非对称性,我们提出了利特尔斯通维度的新型扩展,该维度严格刻画了在可实现设置中学习验证器的错误界。我们提供了最优算法,用于寻找帕累托前沿(在给定健全性错误预算下最小化总错误数)以及最小化非对称成本的线性组合。我们进一步展示了所学验证器如何用于提升一组弱生成器的准确率,并使其能够生成超出初始训练范围的证明。基于一个温和假设(即某个生成器能够以最小概率正确生成下一个推理步骤),我们展示了如何以较低的错误率和弃权率学习一个强生成器。