MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data

Recent large language models (LLMs) have witnessed significant advancement in various tasks, including mathematical reasoning and theorem proving. As these two tasks require strict and formal multi-step inference, they are appealing domains for exploring the reasoning ability of LLMs but still face important challenges. Previous studies such as Chain-of-Thought (CoT) have revealed the effectiveness of intermediate steps guidance. However, such step-wise annotation requires heavy labor, leading to insufficient training steps for current benchmarks. To fill this gap, this work introduces MUSTARD, a data generation framework that masters uniform synthesis of theorem and proof data of high quality and diversity. MUSTARD synthesizes data in three stages: (1) It samples a few mathematical concept seeds as the problem category. (2) Then, it prompts a generative language model with the sampled concepts to obtain both the problems and their step-wise formal solutions. (3) Lastly, the framework utilizes a proof assistant (e.g., Lean Prover) to filter the valid proofs. With the proposed MUSTARD, we present a theorem-and-proof benchmark MUSTARDSAUCE with 5,866 valid data points. Each data point contains an informal statement, an informal proof, and a translated formal proof that passes the prover validation. We perform extensive analysis and demonstrate that MUSTARD generates validated high-quality step-by-step data. We further apply the MUSTARDSAUCE for fine-tuning smaller language models. The fine-tuned Llama 2-7B achieves a 15.41% average relative performance gain in automated theorem proving, and 8.18% in math word problems. Codes and data are available at https://github.com/Eleanor-H/MUSTARD.

翻译：近年来，大型语言模型（LLMs）在包括数学推理和定理证明在内的多项任务中取得了显著进展。由于这两项任务需要严格且形式化的多步推理，它们成为探索LLMs推理能力的理想领域，但仍面临重要挑战。先前的研究，如思维链（Chain-of-Thought，CoT），已揭示了中间步骤指导的有效性。然而，此类逐步标注需要大量人工劳动，导致当前基准的训练步骤不足。为填补这一空白，本研究引入了MUSTARD，一个数据生成框架，旨在掌握高质量、多样化的定理与证明数据的统一合成。MUSTARD通过三个阶段合成数据：（1）采样少量数学概念种子作为问题类别；（2）随后，使用采样的概念提示一个生成式语言模型，以同时获取问题及其逐步形式化解答；（3）最后，该框架利用证明助手（例如Lean Prover）过滤有效证明。基于所提出的MUSTARD，我们构建了一个包含5,866个有效数据点的定理与证明基准MUSTARDSAUCE。每个数据点包含一个非形式化陈述、一个非形式化证明，以及一个通过证明器验证的翻译后的形式化证明。我们进行了广泛的分析，并证明MUSTARD能够生成经过验证的高质量逐步数据。我们进一步将MUSTARDSAUCE用于微调较小的语言模型。经过微调的Llama 2-7B在自动定理证明中实现了15.41%的平均相对性能提升，在数学文字问题中提升了8.18%。代码与数据可在https://github.com/Eleanor-H/MUSTARD获取。