The demand for synthetic data in mathematical reasoning has increased due to its potential to enhance the mathematical capabilities of large language models (LLMs). However, ensuring the validity of intermediate reasoning steps remains a significant challenge, affecting data quality. While formal verification via theorem provers effectively validates LLM reasoning, the autoformalisation of mathematical proofs remains error-prone. In response, we introduce iterative autoformalisation, an approach that iteratively refines theorem prover formalisation to mitigate errors, thereby increasing the execution rate on the Lean prover from 60% to 87%. Building upon that, we introduce Theorem Prover as a Judge (TP-as-a-Judge), a method that employs theorem prover formalisation to rigorously assess LLM intermediate reasoning, effectively integrating autoformalisation with synthetic data generation. Finally, we present Reinforcement Learning from Theorem Prover Feedback (RLTPF), a framework that replaces human annotation with theorem prover feedback in Reinforcement Learning from Human Feedback (RLHF). Across multiple LLMs, applying TP-as-a-Judge and RLTPF improves benchmarks with only 3,508 samples, achieving 5.56% accuracy gain on Mistral-7B for MultiArith, 6.00% on Llama-2-7B for SVAMP, and 3.55% on Llama-3.1-8B for AQUA.
翻译:由于合成数据在提升大语言模型(LLM)数学能力方面的潜力,数学推理领域对合成数据的需求日益增长。然而,确保中间推理步骤的有效性仍然是一个重大挑战,这影响了数据质量。虽然通过定理证明器进行形式化验证能有效检验LLM的推理过程,但数学证明的自动形式化仍然容易出错。为此,我们提出了迭代自动形式化方法,该方法通过迭代优化定理证明器的形式化过程来减少错误,从而将Lean证明器上的执行率从60%提升至87%。在此基础上,我们提出了“定理证明器作为评判者”(TP-as-a-Judge)方法,该方法利用定理证明器形式化来严格评估LLM的中间推理,从而有效地将自动形式化与合成数据生成相结合。最后,我们提出了基于定理证明器反馈的强化学习(RLTPF)框架,该框架在基于人类反馈的强化学习(RLHF)中用定理证明器反馈替代了人工标注。在多个LLM上的实验表明,应用TP-as-a-Judge和RLTPF仅需3,508个样本即可提升基准测试性能:在MultiArith任务上,Mistral-7B的准确率提升了5.56%;在SVAMP任务上,Llama-2-7B的准确率提升了6.00%;在AQUA任务上,Llama-3.1-8B的准确率提升了3.55%。