Trustworthy Software Project Generation : a Case Study with an Interactive Theorem Prover

Generating code from natural-language requirements has become a primary route for LLM-assisted software development. Although LLMs can successfully complete small programming tasks, generating an entire complex project remains unreliable because subtle errors may survive compilation and testing. Verified programming can reduce this risk by requiring generated implementations to satisfy machine-checked specifications. Existing explorations mostly target verification-oriented languages and toolchains such as Dafny and Frama-C, but directly generating project-scale verified code with these systems remains difficult. This paper studies whether interactive theorem provers (ITPs) can support large-scale software generation. Because ITPs handle pure total functions but not effects such as I/O, our agent separates effectful code from pure logic: it implements the effects in the target language, proves the pure logic in an ITP, and extracts it for integration into the final project. We study this route through the fully automatic development of a CPU interpreter for all 47 instructions of the unprivileged RISC-V RV32I base: after the requirements are supplied, no human intervenes in synthesis, proof repair, extraction, or integration. With Rocq as the backend, the agent completes the project within 30 minutes, producing 1,859 lines of verified Rocq and extracting 2,848 lines of C++. The resulting interpreter passes all 265 LLM-generated tests covering the 47 instructions and exhibits zero crashes and zero hangs during 12 hours of AFL++ fuzzing. Under the same configuration, a Dafny-based backend fails to complete verification. Our observation is that Rocq exposes a concrete proof state when a proof attempt fails, giving the agent actionable feedback for repair. These results provide empirical evidence that ITP-based verified programming is a feasible route for LLM-generated software projects.

翻译：从自然语言需求生成代码已成为大语言模型辅助软件开发的主要途径。尽管大语言模型能成功完成小型编程任务，但生成完整复杂项目仍不可靠，因为细微错误可能在编译和测试中存活。验证式编程可通过要求生成实现满足机器检查的规范来降低此风险。现有探索主要针对面向验证的语言和工具链（如Dafny和Frama-C），但直接利用这些系统生成项目级验证代码仍具挑战。本文研究交互式定理证明器能否支持大规模软件生成。由于交互式定理证明器处理纯全函数而非I/O等效应，我们的智能体将含效应代码与纯逻辑分离：在目标语言中实现效应，在交互式定理证明器中证明纯逻辑，并通过提取将其集成到最终项目。我们通过全自动开发支持非特权RISC-V RV32I基础指令集全部47条指令的CPU解释器来研究此路径：需求提供后，无需人工干预合成、证明修复、提取或集成。以Rocq为后端，智能体在30分钟内完成项目，生成1,859行已验证的Rocq代码并提取2,848行C++代码。生成的解释器通过了涵盖47条指令的全部265个LLM生成测试，并在12小时的AFL++模糊测试中实现零崩溃和零挂起。在相同配置下，基于Dafny的后端未能完成验证。我们的观察是，Rocq在证明失败时暴露具体证明状态，为智能体提供可操作的修复反馈。这些结果为基于交互式定理证明器的验证式编程作为大语言模型生成软件项目的可行路径提供了实证依据。