SCDBench: A Benchmark for LLM-Based Smart Contract Decompilers

Smart contract decompilation aims to recover high-level source code from bytecode, but evaluating decompilers remains difficult because existing studies use narrow datasets, inconsistent metrics, and limited semantic consistency checks. This gap is increasingly important as large language models (LLMs) begin to generate source-like Solidity that may compile and appear plausible, even when its semantics diverge from the original contract. We introduce SCDBench, a dataset and benchmark methodology for LLM-based smart contract decompilation. The dataset contains 600 real-world Solidity contracts with paired bytecode inputs, ground-truth source code, and replayable semantic checkpoints. SCDBench evaluates decompiler outputs through four cumulative stages: format completeness, compilability, Application Binary Interface (ABI) recovery, and semantic consistency via differential replay. We evaluate Claude Opus 4.7, GPT-5.3-Codex, and GLM-5 in a zero-shot decompilation setting, including GLM-5 variants with and without extended reasoning and a zero-shot compilation-repair setting. The results show that frontier LLMs can often produce structured and compilable Solidity, but achieving semantic consistency remains far from solved: the best-performing frontier model perfectly decompiles only 42/600 contracts. We further show that introducing same-model compilation repair substantially improves performance at modest additional cost. SCDBench establishes a common ground for rigorous, reproducible evaluation and aims to accelerate the development of reliable smart contract decompilers for blockchain security and transparency.

翻译：智能合约反编译旨在从字节码中恢复高级别源代码，但评估反编译器仍然困难重重，因为现有研究使用狭窄的数据集、不一致的度量标准以及有限的语义一致性检查。随着大型语言模型开始生成可编译且看似合理的类Solidity代码（即使其语义与原始合约存在差异），这一差距变得愈发重要。我们提出了SCDBench，一个针对基于大语言模型的智能合约反编译的数据集与基准测试方法。该数据集包含600个真实的Solidity合约，附带有配对的字节码输入、真实源代码以及可重放的语义检查点。SCDBench通过四个累积阶段评估反编译器输出：格式完整性、可编译性、应用二进制接口恢复，以及通过差分重放实现的语义一致性。我们在零样本反编译设置中评估了Claude Opus 4.7、GPT-5.3-Codex和GLM-5模型，包括带与不带扩展推理的GLM-5变体，以及零样本编译-修复设置。结果表明，前沿大语言模型通常能生成结构化和可编译的Solidity代码，但实现语义一致性仍远未解决：表现最佳的前沿模型仅完美反编译了42/600个合约。我们进一步发现，引入同模型编译修复可在适度的额外成本下显著提升性能。SCDBench为严格、可复现的评估建立了共同基础，旨在加速开发用于区块链安全与透明性的可靠智能合约反编译器。