Large language models (LLMs) have shown impressive capability to understand and develop code. However, their capability to rigorously reason about and prove code correctness remains in question. This paper offers a comprehensive study of LLMs' capability to develop correctness proofs for system software written in Rust. We curate a new system-verification benchmark suite, VeruSAGE-Bench, which consists of 849 proof tasks extracted from eight open-source Verus-verified Rust systems. Furthermore, we design different agent systems to match the strengths and weaknesses of different LLMs (o4-mini, GPT-5, Sonnet 4, and Sonnet 4.5). Our study shows that different tools and agent settings are needed to stimulate the system-verification capability of different types of LLMs. The best LLM-agent combination in our study completes over 80% of system-verification tasks in VeruSAGE-Bench. It also completes over 90% of a set of system proof tasks not part of VeruSAGE-Bench because they had not yet been finished by human experts. This result shows the great potential for LLM-assisted development of verified system software.
翻译:大型语言模型(LLM)在理解和开发代码方面展现出令人瞩目的能力。然而,它们在严格推理和证明代码正确性方面的能力仍存疑问。本文对 LLM 为 Rust 编写的系统软件开发正确性证明的能力进行了全面研究。我们整理了一个新的系统验证基准测试套件 VeruSAGE-Bench,其中包含从八个开源的、经过 Verus 验证的 Rust 系统中提取的 849 个证明任务。此外,我们设计了不同的智能体系统,以匹配不同 LLM(o4-mini、GPT-5、Sonnet 4 和 Sonnet 4.5)的优势和劣势。我们的研究表明,需要不同的工具和智能体设置来激发不同类型 LLM 的系统验证能力。在我们的研究中,最佳的 LLM-智能体组合完成了 VeruSAGE-Bench 中超过 80% 的系统验证任务。它还完成了一组不属于 VeruSAGE-Bench(因为尚未由人类专家完成)的系统证明任务中的超过 90%。这一结果表明了 LLM 辅助开发已验证系统软件的巨大潜力。