We present a benchmark for evaluating AI models and agents on real-world formal software verification tasks. We first scrape 11,039 property-based tests (PBTs) from real-world Python repositories, then automatically translate 2,772 of them (25%) into 9,415 Lean 4 specifications with sorry placeholders (about 3 formalizations/PBT; we retain multiple attempts when none dominates on quality metrics). Translating PBTs into Lean specifications is challenging: it requires modeling Python semantics in Lean, inferring the logical property encoded in an imperative PBT, and handling the inherent difficulties of dependently-typed programming in a seldom-used language. We describe a three-agent LLM pipeline for transpiling PBTs into Lean specifications, evaluate coverage and quality metrics, and provide baselines for proof generation using several automated and model based approaches. All code (scraper and agents) and data (PBTs and Lean specifications) are open source. Our benchmark aims to drive progress on the underexplored problem of AI-assisted formal verification of real-world software, which is of increasing interest as AI produces more and more of the world's code.
翻译:我们提出了一个用于评估AI模型和智能体在现实世界形式化软件验证任务上的基准。首先从真实世界的Python代码仓库中爬取11,039个基于属性的测试(PBT),随后自动将其中2,772个(25%)转化为9,415条包含TODO占位符的Lean 4规范(平均每个PBT对应约3个形式化表述;当质量指标无法判定最优时保留多种尝试)。将PBT转化为Lean规范面临多重挑战:需要在Lean中建模Python语义、推断命令式PBT中编码的逻辑属性,并处理在少用语言中进行依赖类型编程的固有困难。我们描述了一个三智能体LLM流水线用于将PBT转译为Lean规范,评估了覆盖率和质量指标,并基于多种自动化和模型方法提供了证明生成的基线。所有代码(爬虫与智能体)和数据(PBT与Lean规范)均已开源。本基准旨在推动AI辅助现实世界软件形式化验证这一尚未充分探索的问题的发展——随着AI生成代码量日益增长,该议题正变得愈发重要。