Large language models have achieved striking results in interactive theorem proving, particularly in Lean. However, most benchmarks for LLM-based proof automation are drawn from mathematics in the Mathlib ecosystem, whereas proofs in software verification are developed inside definition-rich codebases with substantial project-specific libraries. We introduce VeriSoftBench, a benchmark of 500 Lean 4 proof obligations drawn from open-source formal-methods developments and packaged to preserve realistic repository context and cross-file dependencies. Our evaluation of frontier LLMs and specialized provers yields three observations. First, provers tuned for Mathlib-style mathematics transfer poorly to this repository-centric setting. Second, success is strongly correlated with transitive repository dependence: tasks whose proofs draw on large, multi-hop dependency closures are less likely to be solved. Third, providing curated context restricted to a proof's dependency closure improves performance relative to exposing the full repository, but nevertheless leaves substantial room for improvement. Our benchmark and evaluation suite are released at https://github.com/utopia-group/VeriSoftBench.
翻译:大型语言模型在交互式定理证明领域取得了显著成果,尤其在Lean环境中表现突出。然而,当前基于LLM的证明自动化基准大多来源于Mathlib生态系统中的数学问题,而软件验证领域的证明通常是在定义丰富的代码库中开发的,这些代码库包含大量项目特定的库。本文提出VeriSoftBench——一个包含500个Lean 4证明义务的基准集,这些义务提取自开源形式化方法项目,并通过封装保留了真实的仓库上下文及跨文件依赖关系。我们对前沿LLM与专用证明器进行评估后得到三项发现:首先,针对Mathlib风格数学问题优化的证明器在此类仓库中心化场景中迁移效果较差;其次,任务成功率与传递性仓库依赖呈强相关性——证明过程需要调用大规模多跳依赖闭包的任务更难被解决;第三,提供限定于证明依赖闭包的精选上下文相较于开放完整仓库能提升性能,但仍存在显著改进空间。本基准集与评估套件已发布于https://github.com/utopia-group/VeriSoftBench。