Large language models have become proficient at generating functional code, but ensuring the output truly matches the programmer's intent remains difficult. Testing improves trust, yet for safety-critical applications, formal verification provides the only true guarantees through machine-checked proofs. However, verified code remains scarce compared to mainstream languages or mathematical theorem proving, limiting LLM capabilities in this domain. We present ATLAS, an automated pipeline that synthesizes verified programs to address this data bottleneck. Applied to the TACO dataset of Python solutions to LeetCode-style problems, ATLAS generates 2.7K verified Dafny programs, each with high-quality specifications and machine-checked proofs. Through task decomposition, we extract 19K training examples. Fine-tuning Qwen 2.5 7B Coder on this data improves performance from 32.4% to 56.9% on DafnyBench and from 15.8% to 65.8% on DafnySynthesis, demonstrating that synthetic data generation is a viable path to scaling LLM capabilities for formal verification.
翻译:大型语言模型已能熟练生成功能性代码,但确保输出真正符合程序员意图仍然困难。测试虽能提升可信度,然而对于安全关键型应用,形式化验证通过机器检查的证明提供了唯一真正的保证。但相较于主流编程语言或数学定理证明,经过验证的代码仍然稀缺,这限制了LLM在该领域的能力。我们提出了ATLAS,一种自动化的流水线,用于合成经过验证的程序以解决这一数据瓶颈。将其应用于针对LeetCode风格问题的Python解决方案数据集TACO,ATLAS生成了2.7K个经过验证的Dafny程序,每个程序都包含高质量的规约和机器检查的证明。通过任务分解,我们提取了19K个训练样本。基于此数据对Qwen 2.5 7B Coder进行微调,其在DafnyBench上的性能从32.4%提升至56.9%,在DafnySynthesis上从15.8%提升至65.8%,这表明合成数据生成是扩展LLM形式化验证能力的一条可行路径。