Ensuring the correctness of smart contracts is critical, as even subtle flaws can lead to severe financial losses. While bug detection tools able to spot common vulnerability patterns can serve as a first line of defense, most real-world exploits and losses stem from errors in the contract business logic. Formal verification tools such as SolCMC and the Certora Prover address this challenge, but their impact remains limited by steep learning curves and restricted specification languages. Recent works have begun to explore the use of large language models (LLMs) for security-related tasks such as vulnerability detection and test generation. Yet, a fundamental question remains open: can LLMs aid in assessing the validity of arbitrary contract-specific properties? In this paper, we provide the first systematic empirical evaluation of GPT-5, a state-of-the-art reasoning LLM, in this role. We benchmark its performance on a large dataset of verification tasks, compare its outputs against those of established formal verification tools, and assess its practical effectiveness in real-world auditing scenarios. Our study combines quantitative metrics with qualitative analysis, and shows that recent reasoning-oriented LLMs - although lacking soundness guarantees - can be surprisingly effective at predicting the (in)validity of complex properties, suggesting a new frontier in the convergence of AI and formal methods for secure smart contract development and auditing.


翻译:确保智能合约的正确性至关重要,因为即使细微的缺陷也可能导致严重的财务损失。虽然能够识别常见漏洞模式的错误检测工具可以作为第一道防线,但现实世界中的大多数漏洞利用和损失源于合约业务逻辑中的错误。诸如SolCMC和Certora Prover等形式化验证工具应对了这一挑战,但其影响仍受限于陡峭的学习曲线和受限的规范语言。近期研究已开始探索使用大型语言模型(LLMs)执行安全相关任务,例如漏洞检测和测试生成。然而,一个根本性问题仍未解决:LLMs能否帮助评估任意合约特定属性的有效性?在本文中,我们首次对最先进的推理型LLM——GPT-5——在此角色中的表现进行了系统的实证评估。我们在一个大型验证任务数据集上对其性能进行基准测试,将其输出与成熟的形式化验证工具的输出进行比较,并评估其在现实世界审计场景中的实际有效性。我们的研究结合了定量指标与定性分析,结果表明,尽管缺乏可靠性保证,近期以推理为导向的LLMs在预测复杂属性的(不)有效性方面可以表现出惊人的效果,这为AI与形式化方法在安全智能合约开发与审计领域的融合开辟了新的前沿。

0
下载
关闭预览

相关内容

Top
微信扫码咨询专知VIP会员