Evaluating Autoformalization Robustness via Semantically Similar Paraphrasing

Large Language Models (LLMs) have recently emerged as powerful tools for autoformalization. Despite their impressive performance, these models can still struggle to produce grounded and verifiable formalizations. Recent work in text-to-SQL, has revealed that LLMs can be sensitive to paraphrased natural language (NL) inputs, even when high degrees of semantic fidelity are preserved. In this paper, we investigate this claim in the autoformalization domain. Specifically, we evaluate the robustness of LLMs generating formal proofs with semantically similar paraphrased NL statements by measuring semantic and compilation validity. Using the formal benchmarks MiniF2F and Lean 4 version of ProofNet, and two modern LLMs, we generate paraphrased natural language statements and cross-evaluate these statements across both models. The results of this paper reveal performance variability across paraphrased inputs, demonstrating that minor shifts in NL statements can significantly impact model outputs.

翻译：大型语言模型（LLMs）近期已成为自动形式化的强大工具。尽管性能卓越，这些模型在生成有依据且可验证的形式化内容时仍面临挑战。文本到SQL领域的近期研究表明，即便语义忠实度保持较高水平，LLMs对改写的自然语言输入仍可能敏感。本文在自动形式化领域对此论断展开研究。具体而言，我们通过测量语义有效性与编译有效性，评估LLMs在语义相似的改写自然语言陈述中生成形式化证明的鲁棒性。利用形式化基准MiniF2F与Lean 4版ProofNet，以及两个现代LLMs，我们生成改写后的自然语言陈述，并在两个模型间进行交叉评估。本文结果显示，改写输入的性能存在差异，这表明自然语言陈述的细微变化会显著影响模型输出。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CMU博士论文】大型语言模型的隐性特性

专知会员服务

15+阅读 · 2025年10月18日

多模态大语言模型的自我改进：综述

专知会员服务

29+阅读 · 2025年10月8日

【新书】设计大型语言模型应用：一种面向LLMs的整体方法

专知会员服务

56+阅读 · 2025年3月16日

【伯克利博士论文】《通过高效和自动化系统赋能大型语言模型》，154页pdf

专知会员服务

20+阅读 · 2024年9月3日