Vericoding refers to the generation of formally verified code from rigorous specifications. Recent AI models show promise in vericoding, but a unified methodology for cross-paradigm evaluation is lacking. Existing benchmarks test only individual languages/tools (e.g., Dafny, Verus, and Lean) and each covers very different tasks, so the performance numbers are not directly comparable. We address this gap with AlgoVeri, a benchmark that evaluates vericoding of $77$ classical algorithms in Dafny, Verus, and Lean. By enforcing identical functional contracts, AlgoVeri reveals critical capability gaps in verification systems. While frontier models achieve tractable success in Dafny ($40.3$% for Gemini-3 Flash), where high-level abstractions and SMT automation simplify the workflow, performance collapses under the systems-level memory constraints of Verus ($24.7$%) and the explicit proof construction required by Lean (7.8%). Beyond aggregate metrics, we uncover a sharp divergence in test-time compute dynamics: Gemini-3 effectively utilizes iterative repair to boost performance (e.g., tripling pass rates in Dafny), whereas GPT-OSS saturates early. Finally, our error analysis shows that language design affects the refinement trajectory: while Dafny allows models to focus on logical correctness, Verus and Lean trap models in persistent syntactic and semantic barriers. All data and evaluation code can be found at https://github.com/haoyuzhao123/algoveri.
翻译:验证编码指从严格规范生成形式验证代码。近期人工智能模型在验证编码方面展现出潜力,但缺乏跨范式评估的统一方法。现有基准仅测试单一语言/工具(如Dafny、Verus和Lean),且各自覆盖差异显著的任务,导致性能数据无法直接比较。我们通过AlgoVeri基准填补这一空白,该基准在Dafny、Verus和Lean三种工具中对$77$个经典算法进行验证编码评估。通过强制执行统一功能契约,AlgoVeri揭示了验证系统中的关键能力缺口:前沿模型在Dafny中取得可行成功率(Gemini-3 Flash达$40.3$%),其高层抽象和SMT自动化简化了工作流;但在Verus的系统级内存约束下性能骤降($24.7$%),在Lean所需的显式证明构造中表现更差(7.8%)。除总体指标外,我们发现测试时计算动态存在显著差异:Gemini-3能有效利用迭代修复提升性能(如在Dafny中通过率提升三倍),而GPT-OSS则过早达到饱和。误差分析表明语言设计影响细化轨迹:Dafny允许模型聚焦逻辑正确性,而Verus和Lean使模型陷入持久的语法与语义障碍。所有数据与评估代码可在https://github.com/haoyuzhao123/algoveri获取。