MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data

Recent large language models (LLMs) have witnessed significant advancement in various tasks, including mathematical reasoning and theorem proving. As these two tasks require strict and formal multi-step inference, they are appealing domains for exploring the reasoning ability of LLMs but still face important challenges. Previous studies such as Chain-of-Thought (CoT) have revealed the effectiveness of intermediate steps guidance. However, such step-wise annotation requires heavy labor, leading to insufficient training steps for current benchmarks. To fill this gap, this work introduces MUSTARD, a data generation framework that masters uniform synthesis of theorem and proof data of high quality and diversity. MUSTARD synthesizes data in three stages: (1) It samples a few mathematical concept seeds as the problem category. (2) Then, it prompts a generative language model with the sampled concepts to obtain both the problems and their step-wise formal solutions. (3) Lastly, the framework utilizes a proof assistant (e.g., Lean Prover) to filter the valid proofs. With the proposed MUSTARD, we present a theorem-and-proof benchmark MUSTARDSAUCE with 5,866 valid data points. Each data point contains an informal statement, an informal proof, and a translated formal proof that passes the prover validation. We perform extensive analysis and demonstrate that MUSTARD generates validated high-quality step-by-step data. We further apply the MUSTARDSAUCE for fine-tuning smaller language models. The fine-tuned Llama 2-7B achieves a 15.41% average relative performance gain in automated theorem proving, and 8.18% in math word problems. Codes and data are available at https://github.com/Eleanor-H/MUSTARD.

翻译：近期，大规模语言模型在包括数学推理和定理证明在内的多种任务上取得了显著进展。这两类任务要求严格且形式化的多步骤推理，因此成为探究语言模型推理能力极具吸引力的领域，但仍面临重要挑战。已有研究如思维链（Chain-of-Thought, CoT）揭示了中间步骤引导的有效性，然而此类逐步骤标注需要大量人工成本，导致当前基准数据集的训练步骤不足。为填补这一空白，本文提出MUSTARD——一种能够统一合成高质量、高多样性定理与证明数据的数据生成框架。MUSTARD通过三个阶段合成数据：（1）采样若干数学概念种子作为问题类别；（2）利用采样概念提示生成式语言模型，同时获取问题及其逐步骤形式化解法；（3）最后，框架借助证明助手（如Lean Prover）过滤有效证明。通过所提出的MUSTARD，我们构建了一个包含5,866个有效数据点的定理与证明基准数据集MUSTARDSAUCE。每个数据点包含非形式化陈述、非形式化证明以及通过证明验证器验证的形式化翻译证明。我们进行了广泛分析，表明MUSTARD能够生成经过验证的高质量逐步骤数据。进一步地将MUSTARDSAUCE用于微调小型语言模型，微调后的Llama 2-7B在自动定理证明任务中平均相对性能提升15.41%，在数学文字题任务中提升8.18%。代码与数据已开源至https://github.com/Eleanor-H/MUSTARD。