Where Do Smart Contract Security Analyzers Fall Short?

Smart contracts underpin high-value ecosystems such as decentralized finance (DeFi), yet recurring vulnerabilities continue to cause losses worth billions of dollars. Although numerous security analyzers that detect such flaws exist, real-world attacks remain frequent, raising the question of whether these tools are truly effective or simply under-used due to low developer trust. Prior benchmarks have evaluated analyzers on synthetic or vulnerable-only contract datasets, limiting their ability to measure false positives, false negatives, and usability factors that drive adoption. To close this gap, we present a mixed-methods study that combines large-scale benchmarking with practitioner insights. We evaluate six widely used analyzers (i.e., Confuzzius, Dlva, Mythril, Osiris, Oyente, and Slither) on 653 real-world smart contracts that cover three high-impact vulnerability classes from the OWASP Smart Contract Top Ten (i.e., reentrancy, suicidal contract termination, and integer arithmetic errors). Our results show substantial variation in accuracy (F1 = 31.2 to 94.6%), high false-positive rates (up to 32.6%), and runtimes exceeding 700 seconds per contract. We then survey 150 professional developers and auditors to understand how they use and perceive these tools. Our findings reveal that excessive false positives, vague explanations, and long analysis times are the main barriers to trust and adoption in practice. By linking measurable performance gaps to developer perceptions, we provide concrete recommendations for improving the precision, explainability, and usability of smart-contract security analyzers.

翻译：智能合约支撑着去中心化金融（DeFi）等高价值生态系统，但反复出现的漏洞持续造成数十亿美元的损失。尽管已有众多检测此类缺陷的安全分析工具，现实世界的攻击依然频发，这引发了一个疑问：这些工具是否真正有效，抑或仅仅因为开发者信任度低而未被充分利用？以往的基准测试通常在合成或仅包含漏洞的合约数据集上评估分析工具，限制了其衡量误报、漏报及影响实际采用的可用性因素的能力。为弥补这一差距，我们提出了一项结合大规模基准测试与实践者洞察的混合方法研究。我们在653个真实世界的智能合约上评估了六种广泛使用的分析工具（即Confuzzius、Dlva、Mythril、Osiris、Oyente和Slither），这些合约覆盖了OWASP智能合约十大安全风险中的三类高影响漏洞（即可重入性、合约自杀式终止及整数算术错误）。我们的结果显示，这些工具在准确率（F1分数为31.2%至94.6%）、误报率（高达32.6%）以及单合约分析时间（超过700秒）方面存在显著差异。随后，我们调查了150名专业开发者和审计人员，以了解他们如何使用和看待这些工具。研究发现，过高的误报率、模糊的解释以及冗长的分析时间是实践中阻碍信任和采用的主要障碍。通过将可量化的性能差距与开发者的感知联系起来，我们为提升智能合约安全分析工具的精确性、可解释性和可用性提供了具体建议。