BRIDGE: Building Representations In Domain Guided Program Synthesis

Large language models can generate plausible code, but remain brittle for formal verification in proof assistants such as Lean. A central scalability challenge is that verified synthesis requires consistent artifacts across several coupled domains: executable code, formal specifications, theorem statements, and proof attempts. Existing approaches often treat these artifacts separately. We present BRIDGE, a structured prompting framework for multi-artifact program synthesis. BRIDGE decomposes generation into three interconnected domains: Code, Specification, and Theorem/Proof, and uses domain-specific intermediate reasoning to connect them. In Lean, BRIDGE often follows a code-first workflow, using the generated implementation as a semantic anchor for downstream specification, theorem statement, and proof-attempt generation. Across 178 algorithmic problems and five LLMs, BRIDGE improves Lean executable correctness by up to nearly 1.5x over direct prompting and can be roughly 2x more sample efficient at comparable generation lengths. We further find that specification-oriented prompting improves Python pass rates by up to 17.5 percentage points. Beyond inference-time prompting, supervised fine-tuning on BRIDGE-style reasoning traces yields nearly 1.5x higher Lean pass success than code-only fine-tuning, suggesting that these intermediate representations provide a learnable inductive bias. BRIDGE provides a practical framework for scaling verified synthesis while highlighting the remaining gap between executable correctness and full formal proof generation.

翻译：大型语言模型可以生成看似合理的代码，但在Lean等证明助手中进行形式验证时仍显脆弱。一个核心的可扩展性挑战在于，经过验证的合成需要跨多个耦合领域保持一致的制品：可执行代码、形式规约、定理陈述以及证明尝试。现有方法通常将这些制品分开处理。我们提出BRIDGE，一个用于多制品程序合成的结构化提示框架。BRIDGE将生成过程分解为三个相互关联的领域：代码、规约以及定理/证明，并使用领域特定的中间推理将它们连接起来。在Lean中，BRIDGE通常遵循代码优先的工作流程，将生成的实现作为下游规约、定理陈述和证明尝试生成的语义锚点。在178个算法问题和五个大型语言模型上，相较于直接提示，BRIDGE将Lean可执行代码的正确性提升了近1.5倍，并且在相似生成长度下，样本效率可提升约2倍。我们进一步发现，面向规约的提示使Python通过率最多提升17.5个百分点。除了推理时的提示之外，在BRIDGE风格的推理轨迹上进行监督微调，相比仅基于代码的微调，Lean测试通过率提高了近1.5倍，表明这些中间表示提供了一种可学习的归纳偏置。BRIDGE为扩展验证合成提供了实用框架，同时凸显了可执行代码正确性与完整形式证明生成之间尚存的差距。