Formal verification is the next frontier for ensuring the correctness of code generated by Large Language Models (LLMs). While methods that co-generate code and formal specifications in formal languages, like Dafny, can, in principle, prove alignment with user intent, progress is bottlenecked by specification quality evaluation. Current benchmarks rely on matching against ground-truth specifications, a manual and expertise-intensive process that has limited existing datasets to a few hundred simple problems and also suffers from a reliability issue. To address this, we introduce VeriEquivBench, a new benchmark with $2,389$ complex algorithmic problems that probe the limitations of current models in both code generation and formal reasoning. Our evaluation framework replaces ground-truth matching with a formally grounded metric, the equivalence score, and rigorously verifies the quality of generated specifications and code. Our results show that generating formally verifiable code remains a profound challenge for state-of-the-art LLMs. This underscores both the difficulty of the task and the need for benchmarks like VeriEquivBench to drive progress toward scalable and reliable coding agents.
翻译:形式验证是确保大语言模型生成代码正确性的下一个前沿方向。尽管在Dafny等形式语言中协同生成代码与形式规约的方法在理论上能证明其与用户意图的对齐,但该进展受到规约质量评估的瓶颈限制。当前基准测试依赖人工且需要专业知识的真值规约匹配过程,这不仅将现有数据集限制在数百个简单问题上,还存在可靠性问题。为此,我们提出VeriEquivBench——包含2389个复杂算法问题的新型基准测试,可探测当前模型在代码生成与形式推理两方面的局限性。我们的评估框架采用形式化度量的等价性分数替代真值匹配,并严格验证生成规约与代码的质量。结果表明,生成形式可验证代码对最先进的大语言模型仍是重大挑战。这既凸显了该任务的难度,也彰显了像VeriEquivBench这样的基准测试对于推动可扩展且可靠的编码智能体研发的必要性。