Large Language Models (LLMs) have recently emerged as powerful tools for autoformalization. Despite their impressive performance, these models can still struggle to produce grounded and verifiable formalizations. Recent work in text-to-SQL, has revealed that LLMs can be sensitive to paraphrased natural language (NL) inputs, even when high degrees of semantic fidelity are preserved. In this paper, we investigate this claim in the autoformalization domain. Specifically, we evaluate the robustness of LLMs generating formal proofs with semantically similar paraphrased NL statements by measuring semantic and compilation validity. Using the formal benchmarks MiniF2F and Lean 4 version of ProofNet, and two modern LLMs, we generate paraphrased natural language statements and cross-evaluate these statements across both models. The results of this paper reveal performance variability across paraphrased inputs, demonstrating that minor shifts in NL statements can significantly impact model outputs.
翻译:大型语言模型(LLMs)近期已成为自动形式化的强大工具。尽管性能卓越,这些模型在生成有依据且可验证的形式化内容时仍面临挑战。文本到SQL领域的近期研究表明,即便语义忠实度保持较高水平,LLMs对改写的自然语言输入仍可能敏感。本文在自动形式化领域对此论断展开研究。具体而言,我们通过测量语义有效性与编译有效性,评估LLMs在语义相似的改写自然语言陈述中生成形式化证明的鲁棒性。利用形式化基准MiniF2F与Lean 4版ProofNet,以及两个现代LLMs,我们生成改写后的自然语言陈述,并在两个模型间进行交叉评估。本文结果显示,改写输入的性能存在差异,这表明自然语言陈述的细微变化会显著影响模型输出。