Recently, large language models have presented promising results in aiding formal mathematical reasoning. However, their performance is restricted due to the scarcity of formal theorem-proving data, which requires additional effort to be extracted from raw formal language corpora. Meanwhile, a significant amount of human-written formal language corpora remains underutilized. To address this issue, we propose LEAN-GitHub, a dataset consisting of large-scale formal data extracted from almost all Lean 4 repositories on GitHub. After fine-tuning InternLM-math-plus on this dataset, our model achieved accuracies of 48.8% with a single pass and 54.5% with 64 passes on the Lean 4 miniF2F test, surpassing state-of-the-art method at 52%. And it also achieves state-of-the-art on two other Lean 4 benchmarks (ProofNet and Putnam) targeting different fields/levels of math. These results demonstrate that our proposed dataset is beneficial for formal reasoning on a wide range of math topics. We open-source our model at https://GitHub. com/InternLM/InternLM-Math and our data at https://huggingface.co/ datasets/InternLM/Lean-GitHub
翻译:近年来,大型语言模型在辅助形式化数学推理方面展现出有前景的结果。然而,由于形式化定理证明数据的稀缺性,其性能受到限制,这类数据需要从原始形式化语言语料库中额外提取。与此同时,大量人工编写的形式化语言语料库仍未得到充分利用。为解决此问题,我们提出了LEAN-GitHub,这是一个从GitHub上几乎所有Lean 4仓库中提取的大规模形式化数据集。在此数据集上对InternLM-math-plus进行微调后,我们的模型在Lean 4 miniF2F测试中实现了单次通过准确率48.8%和64次通过准确率54.5%,超越了当前最佳方法的52%。同时,该模型在另外两个针对不同数学领域/难度级别的Lean 4基准测试(ProofNet和Putnam)上也达到了最先进水平。这些结果表明,我们提出的数据集对于广泛数学主题的形式化推理具有显著益处。我们在https://GitHub.com/InternLM/InternLM-Math开源了模型,并在https://huggingface.co/datasets/InternLM/Lean-GitHub开源了数据。