Large language models are increasingly used to generate code from natural language, but ensuring correctness remains challenging. Formal verification offers a principled way to obtain such guarantees by proving that a program satisfies a formal specification. However, specifications are frequently missing in real-world codebases, and writing high-quality specifications remains expensive and expertise-intensive. We present VeriSpecGen, a traceable refinement framework that synthesizes intent-aligned specifications in Lean through requirement-level attribution and localized repair. VeriSpecGen decomposes natural language into atomic requirements and generates requirement-targeted tests with explicit traceability maps to validate generated specifications. When validation fails, traceability maps attribute failures to specific requirements, enabling targeted clause-level repairs. VeriSpecGen achieve 86.6% on VERINA SpecGen task using Claude Opus 4.5, improving over baselines by up to 31.8 points across different model families and scales. Beyond inference-time gains, we generate 343K training examples from VeriSpecGen refinement trajectories and demonstrate that training on these trajectories substantially improves specification synthesis by 62-106% relative and transfers gains to general reasoning abilities.
翻译:大型语言模型越来越多地被用于从自然语言生成代码,但确保正确性仍然具有挑战性。形式化验证通过证明程序满足形式化规约,提供了一种获得此类保证的原则性方法。然而,现实世界代码库中的规约常常缺失,且编写高质量规约成本高昂且需要专业知识。我们提出了 VeriSpecGen,一个可追溯的精化框架,通过需求级归因与局部修复,在 Lean 中合成与意图对齐的规约。VeriSpecGen 将自然语言分解为原子需求,并生成具有显式可追溯性映射的需求导向测试,以验证生成的规约。当验证失败时,可追溯性映射将失败归因于特定需求,从而实现针对性的子句级修复。VeriSpecGen 在使用 Claude Opus 4.5 的 VERINA SpecGen 任务上达到了 86.6% 的准确率,在各模型系列和规模上比基线提升了多达 31.8 个百分点。除了推理阶段性能提升之外,我们还从 VeriSpecGen 的精化轨迹中生成了 34.3 万个训练样本,并证明在这些轨迹上训练可将规约合成能力相对提升 62-106%,并将改进迁移至通用推理能力。