From Evaluation to Enhancement: Large Language Models for Zero-Knowledge Proof Code Generation

Zero-knowledge proofs (ZKPs) are increasingly deployed in domains such as privacy-preserving authentication, verifiable computation, and secure finance. However, authoring ZK programs remains challenging: unlike conventional software development, ZK programming manifests a fundamental paradigm shift from \textit{imperative computation} to \textit{declarative verification}. This process requires rigorous reasoning about finite field arithmetic and complex constraint systems (which is rare in common imperative languages), making it knowledge-intensive and error-prone. While large language models (LLMs) have demonstrated strong code generation capabilities in general-purpose languages, their effectiveness for ZK programming, where correctness hinges on both language mastery and constraint-level reasoning, remains unexplored. To address this gap, we propose \textsc{ZK-Eval}, a domain-specific evaluation pipeline that probes LLM capabilities on ZK programming at three levels: language knowledge, algebraic primitive competence, and end-to-end program generation. Our evaluation of four state-of-the-art LLMs reveals that while models demonstrate strong proficiency in language syntax, they struggle when implementing and composing algebraic primitives to specify correct constraint systems, frequently producing incorrect programs. Based on these insights, we introduce \textsc{ZK-Coder}, an agentic framework that augments LLMs with constraint sketching, guided retrieval, and interactive repair. Experiments with GPT-o3 on Circom and Noir show substantial gains, with success rates improving from 20.29\% to 87.85\% and from 28.38\% to 97.79\%, respectively. With \textsc{ZK-Eval} and \textsc{ZK-Coder}, we establish a new basis for systematically measuring and augmenting LLMs in ZK code generation to lower barriers for practitioners and advance privacy computing.

翻译：零知识证明（ZKPs）在隐私保护认证、可验证计算和安全金融等领域正得到日益广泛的应用。然而，编写零知识程序仍然具有挑战性：与传统软件开发不同，零知识编程体现了从\textit{命令式计算}到\textit{声明式验证}的根本范式转变。这一过程需要对有限域算术和复杂约束系统进行严谨推理（这在常见的命令式语言中很少见），使其成为知识密集型且易出错的任务。尽管大语言模型（LLMs）在通用编程语言中已展现出强大的代码生成能力，但它们在零知识编程中的有效性——其正确性同时取决于语言掌握程度和约束层面的推理能力——仍有待探索。为填补这一空白，我们提出了\textsc{ZK-Eval}，一个特定领域的评估流程，从三个层面探究LLMs在零知识编程上的能力：语言知识、代数原语能力和端到端程序生成。我们对四个最先进LLMs的评估表明，虽然模型在语言语法方面表现出很强的熟练度，但在实现和组合代数原语以指定正确的约束系统时却困难重重，经常生成错误的程序。基于这些发现，我们引入了\textsc{ZK-Coder}，一个代理框架，通过约束草图、引导式检索和交互式修复来增强LLMs。在Circom和Noir上使用GPT-o3进行的实验显示出显著的性能提升，成功率分别从20.29\%提高到87.85\%，以及从28.38\%提高到97.79\%。通过\textsc{ZK-Eval}和\textsc{ZK-Coder}，我们为系统性地度量和增强LLMs在零知识代码生成中的能力建立了新的基础，旨在降低从业者的门槛并推动隐私计算的发展。