Large language models can generate plausible code, but remain brittle for formal verification in proof assistants such as Lean. A central scalability challenge is that verified synthesis requires consistent artifacts across several coupled domains: executable code, formal specifications, theorem statements, and proof attempts. Existing approaches often treat these artifacts separately. We present BRIDGE, a structured prompting framework for multi-artifact program synthesis. BRIDGE decomposes generation into three interconnected domains: Code, Specification, and Theorem/Proof, and uses domain-specific intermediate reasoning to connect them. In Lean, BRIDGE often follows a code-first workflow, using the generated implementation as a semantic anchor for downstream specification, theorem statement, and proof-attempt generation. Across 178 algorithmic problems and five LLMs, BRIDGE improves Lean executable correctness by up to nearly 1.5x over direct prompting and can be roughly 2x more sample efficient at comparable generation lengths. We further find that specification-oriented prompting improves Python pass rates by up to 17.5 percentage points. Beyond inference-time prompting, supervised fine-tuning on BRIDGE-style reasoning traces yields nearly 1.5x higher Lean pass success than code-only fine-tuning, suggesting that these intermediate representations provide a learnable inductive bias. BRIDGE provides a practical framework for scaling verified synthesis while highlighting the remaining gap between executable correctness and full formal proof generation.
翻译:大型语言模型能够生成看似合理的代码,但在Lean等证明助手中进行形式验证时仍显脆弱。一个核心的可扩展性挑战在于,验证式合成需要跨多个耦合领域生成一致性的构件:可执行代码、形式规约、定理陈述以及证明尝试。现有方法往往将这些构件分开处理。我们提出BRIDGE,一种面向多构件程序合成的结构化提示框架。BRIDGE将生成过程分解为三个相互关联的领域:代码、规约与定理/证明,并利用领域特定的中间推理将它们连接起来。在Lean中,BRIDGE通常采用代码优先的工作流程,将生成的实现作为后续规约、定理陈述及证明尝试生成的语义锚点。在178个算法问题及五种大语言模型上的实验表明,与直接提示相比,BRIDGE将Lean可执行代码的正确性提升了近1.5倍,且在相近生成长度下样本效率可提升约2倍。我们进一步发现,面向规约的提示可将Python通过率提升高达17.5个百分点。除推理时提示外,基于BRIDGE式推理轨迹的监督微调比仅基于代码的微调带来了近1.5倍的Lean通过率提升,这表明这些中间表示提供了一种可学习的归纳偏置。BRIDGE为规模化验证式合成提供了一个实用框架,同时凸显了可执行代码正确性与完整形式证明生成之间尚存的差距。