BRIDGE: Building Representations In Domain Guided Program Synthesis

from arxiv, Approx. 23 pages including appendices, 10 figures, 3 tables. Empirical study of LLM-based verified program synthesis in Lean4 (code, specs, and proofs)

Large language models (LLMs) are good at generating code, but remain brittle for formal verification in systems like Lean4. A core scalability challenge is that verified synthesis requires consistent outputs across multiple artifacts: executable code, precise specifications, theorem statements, and ultimately proofs. Existing approaches rarely treat these as a unified pipeline. We present BRIDGE, a structured prompting framework that decomposes verification into three interconnected domains: Code (implementations), Specifications (formal intent), and Theorem Statements (constructive correctness claims), and elicits domain-specific intermediate reasoning to connect them. In Lean4, BRIDGE often adopts a code-first workflow, using the generated implementation as a semantic anchor for downstream specification and theorem statement generation. Across 178 algorithmic problems and five LLMs, BRIDGE improves Lean executable correctness by nearly 1.5x (pass at 5) over direct baselines and can be 2x more sample-efficient at inference time, requiring fewer samples per verified solution at comparable generation lengths. We further find that specification-driven prompting improves Python pass rates by up to 17.5 percent. Beyond inference-time prompting, supervised fine-tuning on BRIDGE-style reasoning traces yields nearly 1.5x higher Lean pass success than code-only SFT, indicating that these intermediate representations are learnable. BRIDGE provides a practical foundation for scaling verified synthesis and motivates future work on expert iteration and full proof generation.

翻译：大型语言模型（LLM）擅长生成代码，但在Lean4等系统中进行形式化验证时仍显脆弱。一个核心的可扩展性挑战在于，经过验证的综合过程需要跨多个构件保持输出的一致性：可执行代码、精确规约、定理陈述以及最终的证明。现有方法很少将这些构件视为统一流程进行处理。本文提出BRIDGE——一种结构化提示框架，它将验证过程分解为三个相互关联的领域：代码（实现）、规约（形式化意图）和定理陈述（构造性正确性声明），并通过激发领域特定的中间推理来建立它们之间的连接。在Lean4环境中，BRIDGE通常采用代码优先的工作流程，将生成的实现作为语义锚点来指导下游规约和定理陈述的生成。在178个算法问题和五种LLM上的实验表明，相较于直接基线方法，BRIDGE将Lean可执行代码的正确率提升了近1.5倍（在5次尝试中的通过率），且在推理阶段可达到2倍的样本效率提升——在生成长度相近的情况下，每个已验证解决方案所需的样本数更少。我们进一步发现，规约驱动的提示方法可将Python代码通过率提升高达17.5%。除推理阶段提示外，对BRIDGE风格推理轨迹进行监督微调，其Lean通过成功率比纯代码监督微调提升近1.5倍，这表明此类中间表示具有可学习性。BRIDGE为扩展可验证综合提供了实用基础，并为专家迭代和完整证明生成的未来研究提供了动力。