Beyond Code Similarity: Benchmarking the Plausibility, Efficiency, and Complexity of LLM-Generated Smart Contracts

Smart Contracts are critical components of blockchain ecosystems, with Solidity as the dominant programming language. While LLMs excel at general-purpose code generation, the unique constraints of Smart Contracts, such as gas consumption, security, and determinism, raise open questions about the reliability of LLM-generated Solidity code. Existing studies lack a comprehensive evaluation of these critical functional and non-functional properties. We benchmark four state-of-the-art models under zero-shot and retrieval-augmented generation settings across 500 real-world functions. Our multi-faceted assessment employs code similarity metrics, semantic embeddings, automated test execution, gas profiling, and cognitive and cyclomatic complexity analysis. Results show that while LLMs produce code with high semantic similarity to real contracts, their functional correctness is low: only 20% to 26% of zero-shot generations behave identically to ground-truth implementations under testing. The generated code is consistently simpler, with significantly lower complexity and gas consumption, often due to omitted validation logic. Retrieval-Augmented Generation markedly improves performance, boosting functional correctness by up to 45% and yielding more concise and efficient code. Our findings reveal a significant gap between semantic similarity and functional plausibility in LLM-generated Smart Contracts. We conclude that while RAG is a powerful enhancer, achieving robust, production-ready code generation remains a substantial challenge, necessitating careful expert validation.

翻译：智能合约是区块链生态系统的关键组成部分，其中Solidity是主导编程语言。尽管大语言模型在通用代码生成方面表现出色，但智能合约特有的约束条件（如Gas消耗、安全性和确定性）引发了关于LLM生成的Solidity代码可靠性的开放性问题。现有研究缺乏对这些关键功能与非功能属性的全面评估。我们在零样本和检索增强生成设置下，对四个前沿模型在500个真实世界函数上进行了基准测试。我们的多维度评估采用了代码相似性度量、语义嵌入、自动化测试执行、Gas分析以及认知与圈复杂度分析。结果表明，虽然LLM生成的代码与真实合约具有较高的语义相似性，但其功能正确性较低：在测试中，仅有20%至26%的零样本生成代码与基准实现行为完全一致。生成的代码普遍更简单，复杂度和Gas消耗显著更低，这通常源于省略了验证逻辑。检索增强生成显著提升了性能，将功能正确性提高了多达45%，并产生了更简洁高效的代码。我们的发现揭示了LLM生成的智能合约在语义相似性与功能合理性之间存在显著差距。我们得出结论：尽管RAG是一种强大的增强手段，但实现稳健、可用于生产环境的代码生成仍是一个重大挑战，需要专家的仔细验证。