Verus-SpecGym: An Agentic Environment for Evaluating Specification Autoformalization

AI coding agents are increasingly used to write real-world software, but ensuring that their outputs are correct remains a fundamental challenge. Formal verification offers a promising path: an agent generates code together with a machine-checked proof, guaranteeing that the code satisfies a formal specification. However, there is no guarantee that the formal spec itself matches the user's intent. In this work, we study specification autoformalization: whether LLM agents can translate informal programming problems into faithful formal specifications. We introduce Verus-SpecBench, a benchmark of 581 spec-writing tasks derived from Codeforces problems targeting Verus, a verifier for Rust, and Verus-SpecGym, an agentic environment in which models interact with Verus, bash, & the filesystem to develop these specs. The central challenge is evaluation: expert-written reference specs are expensive to write, & LLM judges can miss subtle mistakes. We address this by (a) extending Verus's exec_spec mechanism so that generated specs can be executed as Rust code, & (b) testing them against official Codeforces tests & adversarial cases extracted from Codeforces "hacks", which are edge cases written by competitors to break incorrect solutions. On Verus-SpecBench, the strongest model, Gemini 3.1 Pro, solves 77.8% of tasks, other frontier models solve 51.1--57.8% & OSS models reach only 21.5--25.5%. Our analysis of failure modes shows that model-generated specs can omit important input assumptions, accept incorrect outputs, & reject valid ones. We also find that LLM-as-a-judge evaluation misses 26% of the failures our evaluator catches. Overall, our results suggest that spec autoformalization is within reach for frontier agents but remains brittle even on problems where they can already generate correct code. The code, data, & logs can be found at https://github.com/formal-verif-is-cool/verus-spec-gym

翻译：人工智能编程智能体正越来越多地被用于编写真实世界的软件，但确保其输出的正确性仍是一个根本性挑战。形式化验证提供了一条有前景的路径：智能体生成代码的同时附带机器验证的证明，从而保证代码满足形式化规范。然而，形式化规范本身是否与用户意图一致并无保障。本文研究规范自动形式化问题：即LLM智能体能否将非形式化编程问题转化为忠实的形式化规范。我们引入Verus-SpecBench基准，包含581个源自Codeforces问题的规范编写任务（针对Rust轻量级验证器Verus），以及Verus-SpecGym智能体环境，模型可在其中与Verus、bash及文件系统交互来开发这些规范。核心挑战在于评估：专家编写的参考规范成本高昂，且LLM评判器可能遗漏细微错误。我们通过以下方法解决这一问题：(a) 扩展Verus的exec_spec机制，使生成的规范可作为Rust代码执行；(b) 将规范与官方Codeforces测试及从Codeforces“hacks”（参赛者为攻击错误解决方案而编写的边缘案例）中提取的对抗性用例进行比对。在Verus-SpecBench上，最强模型Gemini 3.1 Pro解决了77.8%的任务，其他前沿模型解决了51.1%-57.8%，而开源模型仅达到21.5%-25.5%。我们对失败模式的分析表明，模型生成的规范可能遗漏重要输入假设、接受错误输出或拒绝正确输出。我们还发现，LLM作为评判器的评估方式会遗漏本评估器所捕获的26%的失败案例。总体而言，我们的结果表明，规范自动形式化对前沿智能体而言已触手可及，但即使在这些智能体已能生成正确代码的问题上，该过程仍显脆弱。代码、数据及日志可在https://github.com/formal-verif-is-cool/verus-spec-gym获取。