VERINA: Benchmarking Verifiable Code Generation

Large language models (LLMs) are increasingly integrated in software development, but ensuring correctness in LLM-generated code remains challenging and often requires costly manual review. Verifiable code generation -- jointly generating code, specifications, and proofs of code-specification alignment -- offers a promising path to address this limitation and further unleash LLMs' benefits in coding. Yet, there exists a significant gap in evaluation: current benchmarks often focus on only individual components rather than providing a holistic evaluation framework of all tasks. In this paper, we introduce Verina (Verifiable Code Generation Arena), a high-quality benchmark enabling a comprehensive and modular evaluation of code, specification, and proof generation as well as their compositions. Verina consists of 189 manually curated coding tasks in Lean, with detailed problem descriptions, reference implementations, formal specifications, and extensive test suites. Our extensive evaluation of state-of-the-art LLMs reveals significant challenges in verifiable code generation, especially in proof generation, underscoring the need for improving LLM-based theorem provers in verification domains. The best model, OpenAI o4-mini, achieves a 61.4\% code correctness rate, 51.0\% for specification soundness and completeness, and a mere 3.6\% proof success rate (based on one trial per task). We hope Verina will catalyze progress in verifiable code generation by providing a rigorous and comprehensive benchmark. We release our dataset on https://huggingface.co/datasets/sunblaze-ucb/verina and our evaluation code on https://github.com/sunblaze-ucb/verina.

翻译：大型语言模型正日益融入软件开发，但确保其生成代码的正确性仍具挑战性，通常需要昂贵的人工审查。可验证代码生成——同时生成代码、规范及代码与规范一致性的证明——为解决这一局限并进一步释放大型语言模型在编码领域的潜力提供了可行路径。然而，当前的评估体系存在显著不足：现有基准测试往往仅关注单一组件，缺乏对所有任务进行整体评估的框架。本文提出VERINA（可验证代码生成竞技场），这是一个高质量的基准测试集，支持对代码、规范及证明生成及其组合进行模块化的全面评估。VERINA包含189个经人工筛选的Lean编程任务，每个任务均提供详细的问题描述、参考实现、形式化规范及全面的测试套件。通过对前沿大型语言模型的广泛评估，我们揭示了可验证代码生成，尤其是证明生成方面存在的重大挑战，凸显了在验证领域改进基于大型语言模型的定理证明器的必要性。表现最佳的模型OpenAI o4-mini在代码正确率上达到61.4%，规范健全性与完备性为51.0%，而证明成功率仅为3.6%（基于每项任务单次尝试）。我们期望VERINA能通过提供严谨而全面的基准测试，推动可验证代码生成领域的发展。数据集发布于https://huggingface.co/datasets/sunblaze-ucb/verina，评估代码发布于https://github.com/sunblaze-ucb/verina。