With large language models (LLMs) increasingly applied to mathematical reasoning, formal proof assistants such as Lean can be leveraged to verify reasoning outputs with machine-checkable rigor, enabling use cases such as answer selection in test-time scaling with K sampled candidate answers. However, employing Lean requires that LLM outputs, originally in natural language, first be formalized. Existing Lean-based answer-selection work uses an autoformalization model to generate a formal statement in Lean for each candidate answer independently, incurring a significant computational cost. We propose BASE, a base-and-edit pipeline that formalizes a single base candidate per problem and derives the remaining K-1 statements by editing the answer expression in place. To facilitate this, we train a rewriter model LEANSCRIBE to localize the answer in the base formalization and generate a reusable edit function for the other K-1 candidates. BASE simultaneously improves selection accuracy and reduces formalization cost - a Pareto improvement that holds on all 12 (dataset, solver) configurations across four benchmarks and three solvers, cutting autoformalizer calls by about 5x at K=8, with the reduction expected to become larger as K grows. Code is available at https://github.com/ucr-rai/base-and-edit.
翻译:随着大语言模型(LLM)日益应用于数学推理,Lean等形式化证明辅助工具能够以机器可验证的严谨性验证推理输出,从而支持在测试时缩放中从K个采样候选答案中选择答案等应用场景。然而,使用Lean要求LLM输出(原本为自然语言)首先被形式化。现有基于Lean的答案选择工作使用自动化形式化模型为每个候选答案独立生成Lean形式化语句,这使得计算成本高昂。我们提出BASE——一种基于基候选答案和编辑的流水线:每个问题仅形式化一个基候选答案,并通过原地编辑答案表达式推导出其余K-1条语句。为此,我们训练重写器模型LEANSCRIBE,用于定位基形式化中的答案,并为其他K-1个候选答案生成可复用的编辑函数。BASE同时提升了选择准确率和降低了形式化成本——这一帕累托改进在三个基准测试和四个求解器构成的全部12个(数据集,求解器)配置中均成立,在K=8时减少约5倍的自动化形式化器调用次数,且随着K增大预期减少幅度会更大。代码开源在https://github.com/ucr-rai/base-and-edit。