LeanDojo: Theorem Proving with Retrieval-Augmented Language Models

Large language models (LLMs) have shown promise in proving formal theorems using proof assistants such as Lean. However, existing methods are difficult to reproduce or build on, due to private code, data, and large compute requirements. This has created substantial barriers to research on machine learning methods for theorem proving. This paper removes these barriers by introducing LeanDojo: an open-source Lean playground consisting of toolkits, data, models, and benchmarks. LeanDojo extracts data from Lean and enables interaction with the proof environment programmatically. It contains fine-grained annotations of premises in proofs, providing valuable data for premise selection: a key bottleneck in theorem proving. Using this data, we develop ReProver (Retrieval-Augmented Prover): the first LLM-based prover that is augmented with retrieval for selecting premises from a vast math library. It is inexpensive and needs only one GPU week of training. Our retriever leverages LeanDojo's program analysis capability to identify accessible premises and hard negative examples, which makes retrieval much more effective. Furthermore, we construct a new benchmark consisting of 96,962 theorems and proofs extracted from Lean's math library. It features challenging data split requiring the prover to generalize to theorems relying on novel premises that are never used in training. We use this benchmark for training and evaluation, and experimental results demonstrate the effectiveness of ReProver over non-retrieval baselines and GPT-4. We thus provide the first set of open-source LLM-based theorem provers without any proprietary datasets and release it under a permissive MIT license to facilitate further research.

翻译：大语言模型（LLMs）在利用Lean等证明辅助器进行形式定理证明方面展现出潜力。然而，现有方法因涉及私有代码、数据以及高昂的计算需求而难以复现或拓展，这为基于机器学习的定理证明方法研究设置了重大障碍。本文通过引入LeanDojo——一个包含工具集、数据、模型和基准测试的开源Lean实验平台——消除了这些障碍。LeanDojo可从Lean中提取数据，并通过编程方式与证明环境交互。它包含证明过程中前提的细粒度标注，为关键瓶颈问题"前提选择"提供了宝贵数据。基于这些数据，我们开发了ReProver（基于检索增强的证明器）：首个通过检索技术从海量数学库中选择前提的LLM证明器。该模型成本低廉，仅需单GPU一周的训练时间。我们的检索器利用LeanDojo的程序分析能力识别可访问前提与困难负样本，显著提升了检索效率。此外，我们构建了一个包含96,962个定理及来自Lean数学库证明的新基准测试集。其采用具有挑战性的数据划分方式，要求证明器能泛化至依赖训练中从未出现的新前提的定理。我们利用该基准进行训练与评估，实验结果表明ReProver在无检索基线和GPT-4上均具有有效性。由此，我们首次提供了无需任何专有数据集的开源LLM定理证明器，并以宽松的MIT许可证发布以促进进一步研究。