Recent large language models (LLMs) have witnessed significant advancement in various tasks, including mathematical reasoning and theorem proving. As these two tasks require strict and formal multi-step inference, they are appealing domains for exploring the reasoning ability of LLMs but still face important challenges. Previous studies such as Chain-of-Thought (CoT) have revealed the effectiveness of intermediate steps guidance. However, such step-wise annotation requires heavy labor, leading to insufficient training steps for current benchmarks. To fill this gap, this work introduces MUSTARD, a data generation framework that masters uniform synthesis of theorem and proof data of high quality and diversity. MUSTARD synthesizes data in three stages: (1) It samples a few mathematical concept seeds as the problem category. (2) Then, it prompts a generative language model with the sampled concepts to obtain both the problems and their step-wise formal solutions. (3) Lastly, the framework utilizes a proof assistant (e.g., Lean Prover) to filter the valid proofs. With the proposed MUSTARD, we present a theorem-and-proof benchmark MUSTARDSAUCE with 5,866 valid data points. Each data point contains an informal statement, an informal proof, and a translated formal proof that passes the prover validation. We perform extensive analysis and demonstrate that MUSTARD generates validated high-quality step-by-step data. We further apply the MUSTARDSAUCE for fine-tuning smaller language models. The fine-tuned Llama 2-7B achieves a 15.41% average relative performance gain in automated theorem proving, and 8.18% in math word problems. Codes and data are available at https://github.com/Eleanor-H/MUSTARD.
翻译:近年来,大型语言模型(LLMs)在包括数学推理和定理证明在内的多项任务中取得了显著进展。由于这两项任务需要严格且形式化的多步推理,它们成为探索LLMs推理能力的理想领域,但仍面临重要挑战。先前的研究,如思维链(Chain-of-Thought,CoT),已揭示了中间步骤指导的有效性。然而,此类逐步标注需要大量人工劳动,导致当前基准的训练步骤不足。为填补这一空白,本研究引入了MUSTARD,一个数据生成框架,旨在掌握高质量、多样化的定理与证明数据的统一合成。MUSTARD通过三个阶段合成数据:(1)采样少量数学概念种子作为问题类别;(2)随后,使用采样的概念提示一个生成式语言模型,以同时获取问题及其逐步形式化解答;(3)最后,该框架利用证明助手(例如Lean Prover)过滤有效证明。基于所提出的MUSTARD,我们构建了一个包含5,866个有效数据点的定理与证明基准MUSTARDSAUCE。每个数据点包含一个非形式化陈述、一个非形式化证明,以及一个通过证明器验证的翻译后的形式化证明。我们进行了广泛的分析,并证明MUSTARD能够生成经过验证的高质量逐步数据。我们进一步将MUSTARDSAUCE用于微调较小的语言模型。经过微调的Llama 2-7B在自动定理证明中实现了15.41%的平均相对性能提升,在数学文字问题中提升了8.18%。代码与数据可在https://github.com/Eleanor-H/MUSTARD获取。