Certified Program Synthesis with a Multi-Modal Verifier

Certified program synthesis (aka vericoding) is the process of automatically generating a program, its formal specification, and a machine-checkable proof of their alignment from a natural-language description. Two challenges make vericoding difficult. First, specifications synthesised from natural language are often either too weak to be meaningful or too strong to be implementable, yet existing approaches lack systematic means to detect such defects. Second, the landscape of program verifiers is fragmented: each tool supports a particular reasoning mode -- auto-active (e.g., Dafny, Verus) or interactive (e.g., Coq, Lean) -- with its own trade-off between automation and expressivity. This forces every synthesis methodology to be tailored to a single verification paradigm, limiting the class of tasks it can handle effectively. We overcome both challenges by structuring the certified synthesis workflow around a multi-modal verifier -- a single tool combining dynamic validation, automated proofs, and interactive proof scripting in one foundational framework. We realise this idea in LeetProof, an agentic pipeline built on Velvet, a multi-modal verifier embedded in Lean. Multi-modality enables LeetProof to validate generated specifications via randomised property-based testing before any code is synthesised, decompose the synthesis task into sub-problems guided by verification conditions, and delegate residual proof obligations to frontier AI provers specialised for Lean. We evaluate LeetProof on benchmarks derived from prior work on certified synthesis. Our specification validation uncovers defects in existing reference benchmarks, and LeetProof's staged pipeline achieves a significantly higher rate of fully certified solutions than a single-mode baseline at the same budget -- consistently across two frontier LLM backends.

翻译：认证程序合成（亦称验码）是指从自然语言描述中自动生成程序、其形式规约以及两者一致性的机器可验证证明的过程。两项挑战使验码变得困难：首先，从自然语言合成的规约往往要么过弱而缺乏意义，要么过强而难以实现，但现有方法缺乏系统性手段检测此类缺陷；其次，程序验证工具格局碎片化：每种工具支持特定的推理模式——自动交互式（如Dafny、Verus）或交互式（如Coq、Lean）——在自动化与表达力之间各有取舍。这迫使每种合成方法论必须针对单一验证范式定制，限制了其能有效处理的任务类别。我们通过将认证合成工作流围绕多模态验证器——一种在单一基础框架中结合动态验证、自动化证明与交互式证明脚本的工具——结构化组织来克服这两项挑战。我们在LeetProof中实现了这一理念，这是一个基于嵌入Lean的多模态验证器Velvet构建的智能体流水线。多模态性使LeetProof能在合成任何代码前通过随机化基于属性的测试验证生成的规约，将合成任务分解为以验证条件为指导的子问题，并将剩余证明义务委托给专精于Lean的前沿AI证明器。我们在源自先前认证合成工作的基准测试上评估LeetProof。我们的规约验证揭示了现有参考基准中的缺陷，且LeetProof的分阶段流水线在相同预算下实现了比单模态基线显著更高的完全认证解决方案率——此结果在两个前沿LLM后端上保持一致。