Automating Formal Verification with Agent-Guided Tree Search

Formal verification offers a path to provably correct software, but writing verified code remains expensive enough that the technique is rarely used in production. Recent large language models can accelerate this work, and recent benchmarks measure their ability to translate specifications into code and machine-checked proofs of correctness. This thesis evaluates the state of such LLM-driven verified-code generation ("vericoding") in Lean and develops search-based methods for improving verification performance. We first reproduce a subset of the vericoding-benchmark Lean leaderboard on a current cross-vendor model pool, finding that non-reasoning performance remains roughly steady on US closed-source models while open-weight models have slightly improved. We update the iterative methodology of vericoding-benchmark with an agentic loop equipped with mathlib search, finding that model performance greatly improves and scales with agent budget. GPT-5.4 nearly saturates the benchmark at 95.0% on 423 specs with $K=50$ LLM calls. We then design two agent-directed tree-search formulations: a state-based orchestrator that branches on partial-proof states, and a context-based orchestrator that branches on full subagent contexts. Compared against the agent baseline, the context-based design solves a wider range of intermediate-difficulty specs at lower token cost, while the agent baseline retains an advantage on the hardest specs, where uninterrupted iteration matters most. We conclude that search structure has selective advantages over a strong agent baseline, and that more challenging benchmarks drawn from modern code are important to measure and drive further progress in automated formal verification. Code available upon request by contacting the author at [email protected].

翻译：形式化验证为证明软件正确性提供了途径，但编写已验证代码的成本依然高昂，以至于该技术在生产环境中鲜有使用。近期的大语言模型可加速这一工作，而新近的基准评测则用于衡量其将规约转化为代码及机器检验的正确性证明能力。本论文评估了此类大语言模型驱动的已验证代码生成（简称为"vericoding"）在Lean语言中的现状，并发展了基于搜索的方法以提升验证性能。我们首先在当前的跨供应商模型池上复现了vericoding基准Lea排行榜的子集，发现非推理性能在闭源美国模型中大致保持稳定，而开源权重模型则略有提升。我们通过配备mathlib搜索的智能体循环更新了vericoding基准的迭代方法论，发现模型性能显著提升且与智能体预算成正比。GPT-5.4在423个规约上以K=50次大语言模型调用达到了95.0%的接近饱和基准性能。我们继而设计了两种面向智能体的树搜索形式：基于状态的编排器对部分证明状态进行分支，基于上下文的编排器则对完整子智能体上下文中进行分支。相较于智能体基线，基于上下文的设计以更低的令牌成本解决了更广泛的中间难度规约，而智能体基线在最难的规约上仍保持优势——在此类场景中不间断迭代最为关键。我们得出结论：搜索结构相较于强智能体基线具有选择性优势，且源自现代代码的高难度基准对于衡量和推动自动化形式化验证的进一步进展至关重要。代码获取请通过联系作者[email protected]。