VERINA: Benchmarking Verifiable Code Generation

Large language models (LLMs) are increasingly integrated in software development, but ensuring correctness in LLM-generated code remains challenging and often requires costly manual review. Verifiable code generation -- jointly generating code, specifications, and proofs of code-specification alignment -- offers a promising path to address this limitation and further unleash LLMs' benefits in coding. Yet, there exists a significant gap in evaluation: current benchmarks often focus on only individual components rather than providing a holistic evaluation framework of all tasks. In this paper, we introduce VERINA (Verifiable Code Generation Arena), a high-quality benchmark enabling a comprehensive and modular evaluation of code, specification, and proof generation as well as their compositions. VERINA consists of 189 manually curated coding tasks in Lean, with detailed problem descriptions, reference implementations, formal specifications, and extensive test suites. Our extensive evaluation of state-of-the-art LLMs reveals significant challenges in verifiable code generation, especially in proof generation, underscoring the need for improving LLM-based theorem provers in verification domains. The best model, OpenAI o3, achieves a 72.6\% code correctness rate, 52.3\% for specification soundness and completeness, and a mere 4.9\% proof success rate (based on one trial per task). We hope VERINA will catalyze progress in verifiable code generation by providing a rigorous and comprehensive benchmark. We release our dataset on https://huggingface.co/datasets/sunblaze-ucb/verina and our evaluation code on https://github.com/sunblaze-ucb/verina.

翻译：大型语言模型（LLMs）正日益融入软件开发流程，但确保LLM生成代码的正确性仍具挑战性且常需耗费高昂的人工审查成本。可验证代码生成——即联合生成代码、规范及代码与规范一致性的证明——为解决这一局限并进一步释放LLMs在编程领域的潜力提供了可行路径。然而，当前评估体系存在显著空白：现有基准测试往往仅关注单一组件，缺乏对所有任务进行整体评估的框架。本文提出VERINA（可验证代码生成竞技场），这是一个支持对代码生成、规范生成、证明生成及其组合任务进行全面模块化评估的高质量基准。VERINA包含189项基于Lean语言手工构建的编程任务，每项任务均配备详细的问题描述、参考实现、形式化规范及扩展测试套件。通过对前沿LLMs的广泛评估，我们揭示了可验证代码生成领域存在的重大挑战，尤其在证明生成方面，这凸显了在验证领域改进基于LLM的定理证明器的迫切需求。表现最佳的模型OpenAI o3在代码正确率上达到72.6%，规范健全性与完备性为52.3%，而证明成功率仅为4.9%（基于每项任务单次尝试）。我们期望VERINA能通过提供严谨全面的基准测试，推动可验证代码生成领域的发展。数据集发布于https://huggingface.co/datasets/sunblaze-ucb/verina，评估代码发布于https://github.com/sunblaze-ucb/verina。