TheoremLlama: Transforming General-Purpose LLMs into Lean4 Experts

Proving mathematical theorems using computer-verifiable formal languages like Lean significantly impacts mathematical reasoning. One approach to formal theorem proving involves generating complete proofs using Large Language Models (LLMs) based on Natural Language (NL) proofs. Similar methods have shown promising results in code generation. However, most modern LLMs exhibit suboptimal performance due to the scarcity of aligned NL and Formal Language (FL) theorem-proving data. This scarcity results in a paucity of methodologies for training LLMs and techniques to fully utilize their capabilities in composing formal proofs. To address the challenges, this paper proposes **TheoremLlama**, an end-to-end framework to train a general-purpose LLM to become a Lean4 expert. This framework encompasses NL-FL aligned dataset generation methods, training approaches for the LLM formal theorem prover, and techniques for LLM Lean4 proof writing. Using the dataset generation method, we provide *Open Bootstrapped Theorems* (OBT), an NL-FL aligned and bootstrapped dataset. A key innovation in this framework is the NL-FL bootstrapping method, where NL proofs are integrated into Lean4 code for training datasets, leveraging the NL reasoning ability of LLMs for formal reasoning. The **TheoremLlama** framework achieves cumulative accuracies of 36.48% and 33.61% on MiniF2F-Valid and Test datasets respectively, surpassing the GPT-4 baseline of 22.95% and 25.41%. We have also open-sourced our model checkpoints and generated dataset, and will soon make all the code publicly available.

翻译：使用Lean等计算机可验证的形式语言证明数学定理对数学推理具有重要影响。形式化定理证明的一种方法基于自然语言证明，利用大语言模型生成完整证明。类似方法在代码生成领域已展现出良好效果。然而，由于对齐的自然语言与形式语言定理证明数据稀缺，当前多数大语言模型表现欠佳。这种稀缺性导致缺乏训练大语言模型的方法论，以及充分发挥其在形式化证明撰写中潜力的技术。为应对这些挑战，本文提出**TheoremLlama**——一个将通用大语言模型训练为Lean4专家的端到端框架。该框架涵盖自然语言-形式语言对齐数据集生成方法、大语言模型形式定理证明器的训练策略，以及大语言模型撰写Lean4证明的技术。通过数据集生成方法，我们构建了*开放自举定理数据集*——一个自然语言-形式语言对齐的自举数据集。本框架的核心创新在于自然语言-形式语言自举方法，该方法将自然语言证明整合至Lean4代码以构建训练数据集，从而利用大语言模型的自然语言推理能力进行形式化推理。**TheoremLlama**框架在MiniF2F-验证集和测试集上分别达到36.48%和33.61%的累计准确率，显著超越GPT-4基线模型的22.95%和25.41%。我们已开源模型检查点与生成的数据集，并将很快公开全部代码。