Large Language Models (LLMs) hold the potential to revolutionize autoformalization. The introduction of Lean4, a mathematical programming language, presents an unprecedented opportunity to rigorously assess the autoformalization capabilities of LLMs. This paper introduces a novel evaluation benchmark designed for Lean4, applying it to test the abilities of state-of-the-art LLMs, including GPT-3.5, GPT-4, and Gemini Pro. Our comprehensive analysis reveals that, despite recent advancements, these LLMs still exhibit limitations in autoformalization, particularly in more complex areas of mathematics. These findings underscore the need for further development in LLMs to fully harness their potential in scientific research and development. This study not only benchmarks current LLM capabilities but also sets the stage for future enhancements in autoformalization.
翻译:大型语言模型(LLMs)具有革新自动形式化领域的潜力。Lean4作为一种数学编程语言的引入,为严格评估LLMs的自动形式化能力提供了前所未有的机遇。本文提出了一种专为Lean4设计的新型评估基准,并应用该基准测试了包括GPT-3.5、GPT-4和Gemini Pro在内的前沿LLMs。我们的综合分析表明,尽管近期取得了进展,这些LLMs在自动形式化方面仍存在局限性,尤其在更复杂的数学领域。这些发现凸显了需要进一步开发LLMs以充分发挥其在科学研究与开发中的潜力。本研究不仅为当前LLM能力建立了基准,更为未来自动形式化技术的改进奠定了基础。