Automated formal verification remains challenging for large language models because data for proof assistants and verification-aware languages is scarce, and correctness depends on satisfying precise machine-checkable specifications rather than producing plausible code. This thesis studies how verifier environments can improve LLM generation of verified programs and proofs through reinforcement learning from verifiable rewards (RLVR) and verifier-guided inference-time search. First, we train open-source models in Dafny with RLVR using Group Relative Policy Optimization (GRPO) and related variants, assembling generated candidates into complete programs and scoring them with compiler and verifier outcomes. Initial experiments on an APPS-derived Dafny dataset increased verified reward from 2.2% to 58.1%, but revealed specification hacking, where models exploit weak formal specifications instead of implementing the intended solutions. After filtering underspecified and vulnerable tasks, multi-turn RLVR on the refined benchmark improves the verified pass rate from 9.7% to 31.1%. Second, we develop a verifier-guided inference scaffold in Lean that treats proof generation as structured search over decomposed subgoals, verifier feedback, diagnostics, and repair. With a fixed base model, the full scaffold with proof reviser improves pass rate on an initial VeriCoding pilot set from 46.2% under direct repair to 69.2%. On the larger VERINA dataset, whole-task decomposition plus proof reviser solves 7 of 42 previously unsolved tasks. We also introduce Dalek-Bench, a repository-scale Lean benchmark derived from the Rust $\texttt{curve25519-dalek}$ verification project; preliminary results remain weak, indicating that stronger progress evaluation and task-specific tool-use policies are still needed.
翻译:自动化形式化验证对大型语言模型而言仍具挑战性:这是因为面向证明辅助工具和验证感知语言的数据稀缺,且正确性取决于满足精确的机器可检查规范,而非生成看似合理的代码。本论文研究验证器环境如何通过可验证奖励强化学习(RLVR)和验证器引导的推理时搜索,提升大模型生成已验证程序与证明的能力。首先,我们采用组相对策略优化(GRPO)及变体方法,在Dafny语言中使用RLVR训练开源模型,将生成的候选程序组装为完整程序,并依据编译器与验证器的输出结果对其评分。在基于APPS构建的Dafny数据集上进行的初步实验,将已验证奖励从2.2%提升至58.1%,但暴露出"规范作弊"问题——模型利用薄弱的正式规范而非实现预期解决方案。在过滤掉欠规范和易受攻击的任务后,针对精炼基准的多轮RLVR将验证通过率从9.7%提升至31.1%。其次,我们在Lean证明助手中开发了验证器引导的推理框架,将证明生成转化为对分解子目标、验证器反馈、诊断与修复的结构化搜索。对于固定基础模型,包含证明修正器的完整框架在初始VeriCoding试点集上,将直接修复方法的通过率从46.2%提升至69.2%。在更大的VERINA数据集上,全任务分解结合证明修正器解决了42个先前未解决任务中的7个。我们还引入了Dalek-Bench——一个源自Rust $\texttt{curve25519-dalek}$验证项目的仓库级Lean基准测试;初步结果仍较弱,表明仍需更强的进度评估与任务特定工具使用策略。