Formal verification is the next frontier for ensuring the correctness of code generated by Large Language Models (LLMs). While methods that co-generate code and formal specifications in formal languages, like Dafny, can, in principle, prove alignment with user intent, progress is bottlenecked by specification quality evaluation. Current benchmarks rely on matching against ground-truth specifications, a manual and expertise-intensive process that has limited existing datasets to a few hundred simple problems and also suffers from a reliability issue. To address this, we introduce VeriEquivBench, a new benchmark with $2,389$ complex algorithmic problems that probe the limitations of current models in both code generation and formal reasoning. Our evaluation framework replaces ground-truth matching with a formally grounded metric, the equivalence score, and rigorously verifies the quality of generated specifications and code. Our results show that generating formally verifiable code remains a profound challenge for state-of-the-art LLMs. This underscores both the difficulty of the task and the need for benchmarks like VeriEquivBench to drive progress toward scalable and reliable coding agents.
翻译:形式化验证是确保大型语言模型(LLMs)生成代码正确性的下一个前沿方向。尽管通过协同生成代码与形式规范(如Dafny中的形式语言)的方法在原则上能够证明与用户意图的一致性,但其进展受限于规范质量评估的瓶颈。当前基准测试依赖与真值规范进行匹配——这一需专家手动完成且专业门槛极高的过程,导致现有数据集仅局限于数百个简单问题,同时存在可靠性缺陷。为此,我们提出VeriEquivBench——一个包含2389个复杂算法问题的新基准,可探查当前模型在代码生成与形式推理两方面的局限性。本评估框架采用基于形式化验证的等价性分数替代真值匹配,并严谨验证生成规范与代码的质量。结果表明,生成形式可验证代码对最先进的LLMs而言仍是巨大挑战。这既凸显了该任务的难度,也表明需要像VeriEquivBench这样的基准来驱动可扩展且可靠代码智能体研究的发展。