Autoformalization, which translates natural language mathematics into machine-verifiable formal statements, is critical for using formal mathematical reasoning to solve math problems stated in natural language. While Large Language Models can generate syntactically correct formal statements, they often fail to preserve the original problem's semantic intent. This limitation arises from the LLM approaches' treating autoformalization as a simplistic translation task which lacks mechanisms for self-reflection and iterative refinement that human experts naturally employ. To address these issues, we propose ReForm, a Reflective Autoformalization method that tightly integrates semantic consistency evaluation into the autoformalization process. This enables the model to iteratively generate formal statements, assess its semantic fidelity, and self-correct identified errors through progressive refinement. To effectively train this reflective model, we introduce Prospective Bounded Sequence Optimization (PBSO), which employs different rewards at different sequence positions to ensure that the model develops both accurate autoformalization and correct semantic validations, preventing superficial critiques that would undermine the purpose of reflection. Extensive experiments across four autoformalization benchmarks demonstrate that ReForm achieves an average improvement of 22.6 percentage points over the strongest baselines. To further ensure evaluation reliability, we introduce ConsistencyCheck, a benchmark of 859 expert-annotated items that not only validates LLMs as judges but also reveals that autoformalization is inherently difficult: even human experts produce semantic errors in up to 38.5% of cases.
翻译:自动形式化旨在将自然语言数学问题转化为机器可验证的形式化陈述,这对于利用形式化数学推理解决自然语言表述的数学问题至关重要。尽管大型语言模型能够生成语法正确的形式化陈述,但它们往往无法保持原始问题的语义意图。这一局限性源于现有LLM方法将自动形式化视为简单的翻译任务,缺乏人类专家自然运用的自我反思与迭代优化机制。为解决这些问题,我们提出ReForm——一种反思式自动形式化方法,它将语义一致性评估紧密整合到自动形式化过程中。这使得模型能够迭代生成形式化陈述、评估其语义保真度,并通过渐进优化对识别出的错误进行自我修正。为有效训练这一反思模型,我们提出了前瞻有界序列优化方法,该方法在不同序列位置采用不同的奖励机制,以确保模型既能实现准确的自动形式化,又能进行正确的语义验证,从而避免损害反思目的的表面化批判。在四个自动形式化基准上的大量实验表明,ReForm相较于最强基线模型平均提升了22.6个百分点。为进一步确保评估可靠性,我们提出了ConsistencyCheck基准数据集,该数据集包含859项专家标注样本,不仅验证了LLM作为评判者的有效性,还揭示了自动形式化本身固有的困难性:即使人类专家在高达38.5%的情况下也会产生语义错误。