Towards Automated Smart Contract Generation: Evaluation, Benchmarking, and Retrieval-Augmented Repair

Smart contracts, predominantly written in Solidity and deployed on blockchains such as Ethereum, are immutable after deployment, making functional correctness critical. However, existing evaluations of Solidity code generation rely largely on surface-level metrics (e.g., BLEU, CrystalBLEU) or manual inspection, which correlate poorly with functional correctness. In contrast to Python, Solidity lacks large-scale, execution-based benchmarks, limiting systematic evaluation of large language models for smart contract development. We introduce SolBench, a comprehensive benchmark and automated testing pipeline for Solidity that emphasizes functional correctness via differential fuzzing. SolBench consists of 28825 functions extracted from 7604 real-world smart contracts collected from Etherscan (genesis-2024), spanning ten application domains. We benchmark 14 diverse LLMs, covering open and closed models, 1.3B-671B parameters, and both general-purpose and code-specialized architectures. The dominant failure mode is missing critical intra-contract information, such as state variables and type definitions. Providing full-contract context improves accuracy but incurs prohibitive inference costs. To address this, we propose Retrieval-Augmented Repair (RAR), a cost-effective framework that integrates execution feedback into code repair. RAR uses compiler and runtime error messages to retrieve only the minimal contract snippets needed to correct a target function, avoiding full-context inference. This significantly reduces input length while improving functional correctness. We further analyze retrieval and repair strategies within RAR, demonstrating consistent gains in accuracy and efficiency. SolBench and RAR enable principled, execution-based evaluation and economical improvement of Solidity code generation. Dataset and code are publicly available at https://github.com/ZaoyuChen/SolBench.

翻译：智能合约主要使用 Solidity 语言编写并部署于以太坊等区块链上，一经部署即不可更改，因此功能正确性至关重要。然而，现有对 Solidity 代码生成的评估主要依赖表层指标（如 BLEU、CrystalBLEU）或人工检查，这些方法与功能正确性的关联性较弱。与 Python 不同，Solidity 缺乏大规模、基于执行的基准测试，这限制了对大型语言模型在智能合约开发中的系统性评估。我们提出了 SolBench，一个面向 Solidity 的综合性基准测试与自动化测试流水线，通过差分模糊测试强调功能正确性。SolBench 包含从 Etherscan（创世区块至 2024 年）收集的 7604 个真实智能合约中提取的 28825 个函数，涵盖十个应用领域。我们对 14 个多样化的大型语言模型进行了基准测试，涵盖开源与闭源模型、1.3B 至 671B 参数规模，以及通用型和代码专用型架构。主要的失败模式是缺失关键的合约内部信息，如状态变量和类型定义。提供完整合约上下文虽能提高准确性，但会带来高昂的推理成本。为解决此问题，我们提出了检索增强修复（RAR），一个经济高效的框架，将执行反馈集成到代码修复中。RAR 利用编译器和运行时错误信息，仅检索纠正目标函数所需的最小合约片段，避免完整上下文推理。这显著减少了输入长度，同时提高了功能正确性。我们进一步分析了 RAR 中的检索与修复策略，证明了其在准确性和效率上的持续提升。SolBench 与 RAR 为 Solidity 代码生成提供了基于执行的原则性评估与经济高效的改进途径。数据集与代码已公开于 https://github.com/ZaoyuChen/SolBench。