SolContractEval: A Benchmark for Evaluating Contract-Level Solidity Code Generation

The rise of blockchain has brought smart contracts into mainstream use, creating a demand for smart contract generation tools. While large language models (LLMs) excel at generating code in general-purpose languages, their effectiveness on Solidity, the primary language for smart contracts, remains underexplored. Solidity constitutes only a small portion of typical LLM training data and differs from general-purpose languages in its version-sensitive syntax and limited flexibility. These factors raise concerns about the reliability of existing LLMs for Solidity code generation. Critically, existing evaluations, focused on isolated functions and synthetic inputs, fall short of assessing models' capabilities in real-world contract development. To bridge this gap, we introduce SolContractEval, the first contract-level benchmark for Solidity code generation. It comprises 124 tasks drawn from real on-chain contracts across nine major domains. Each task input, consisting of complete context dependencies, a structured contract framework, and a concise task prompt, is independently annotated and cross-validated by experienced developers. To enable precise and automated evaluation of functional correctness, we also develop a dynamic evaluation framework based on historical transaction replay. Building on SolContractEval, we perform a systematic evaluation of six mainstream LLMs. We find that Claude-3.7-Sonnet achieves the highest overall performance, though evaluated models underperform relative to their capabilities on class-level generation tasks in general-purpose programming languages. Second, current models perform better on tasks that follow standard patterns but struggle with complex logic and inter-contract dependencies. Finally, they exhibit limited understanding of Solidity-specific features and contextual dependencies.

翻译：区块链的兴起使智能合约进入主流应用，催生了智能合约生成工具的需求。尽管大语言模型（LLM）在通用编程语言的代码生成方面表现出色，但其在智能合约主要语言Solidity上的有效性仍未得到充分探索。Solidity在典型LLM训练数据中仅占很小比例，且其版本敏感的语法和有限的灵活性与通用编程语言存在差异。这些因素引发了人们对现有LLM在Solidity代码生成方面可靠性的担忧。关键的是，现有评估方法聚焦于孤立函数和合成输入，难以评估模型在实际合约开发中的能力。为填补这一空白，我们提出了SolContractEval——首个面向Solidity代码生成的合约级基准。该基准包含从九个主要领域的真实链上合约中提取的124项任务。每项任务输入包含完整的上下文依赖、结构化合约框架和简洁的任务提示，均由经验丰富的开发者独立标注并交叉验证。为实现功能正确性的精准自动化评估，我们还开发了基于历史交易回放的动态评估框架。基于SolContractEval，我们对六种主流LLM进行了系统评估。研究发现：Claude-3.7-Sonnet取得了最高综合性能，但相较于通用编程语言中类级别生成任务的表现，所有评估模型均未充分发挥其潜力；其次，当前模型在遵循标准模式的任务上表现较好，但在处理复杂逻辑和跨合约依赖时存在困难；最后，模型对Solidity特有特性及上下文依赖的理解能力有限。