Large language models have demonstrated impressive capabilities across various natural language processing tasks, especially in solving mathematical problems. However, large language models are not good at math theorem proving using formal languages like Lean. A significant challenge in this area is the scarcity of training data available in these formal languages. To address this issue, we propose a novel pipeline that iteratively generates and filters synthetic data to translate natural language mathematical problems into Lean 4 statements, and vice versa. Our results indicate that the synthetic data pipeline can provide useful training data and improve the performance of LLMs in translating and understanding complex mathematical problems and proofs. Our final dataset contains about 57K formal-informal question pairs along with searched proof from the math contest forum and 21 new IMO questions. We open-source our code at https://github.com/InternLM/InternLM-Math and our data at https://huggingface.co/datasets/InternLM/Lean-Workbook.
翻译:大型语言模型在各种自然语言处理任务中展现出令人印象深刻的能力,尤其在解决数学问题方面。然而,大型语言模型在使用Lean等正式语言进行数学定理证明方面表现欠佳。该领域面临的一个重大挑战是可用的正式语言训练数据匮乏。为解决这一问题,我们提出了一种新颖的流水线方法,通过迭代生成并过滤合成数据,将自然语言数学问题翻译为Lean 4语句,反之亦然。实验结果表明,该合成数据流水线能够提供有效的训练数据,并提升大型语言模型在翻译和理解复杂数学问题及证明方面的性能。最终数据集包含约57K条正式-非正式问题对,以及从数学竞赛论坛搜索得到的证明和21个新的IMO问题。我们在https://github.com/InternLM/InternLM-Math开源代码,并在https://huggingface.co/datasets/InternLM/Lean-Workbook 公开数据。