When ChatGPT Meets Smart Contract Vulnerability Detection: How Far Are We?

With the development of blockchain technology, smart contracts have become an important component of blockchain applications. Despite their crucial role, the development of smart contracts may introduce vulnerabilities and potentially lead to severe consequences, such as financial losses. Meanwhile, large language models, represented by ChatGPT, have gained great attentions, showcasing great capabilities in code analysis tasks. In this paper, we presented an empirical study to investigate the performance of ChatGPT in identifying smart contract vulnerabilities. Initially, we evaluated ChatGPT's effectiveness using a publicly available smart contract dataset. Our findings discover that while ChatGPT achieves a high recall rate, its precision in pinpointing smart contract vulnerabilities is limited. Furthermore, ChatGPT's performance varies when detecting different vulnerability types. We delved into the root causes for the false positives generated by ChatGPT, and categorized them into four groups. Second, by comparing ChatGPT with other state-of-the-art smart contract vulnerability detection tools, we found that ChatGPT's F-score is lower than others for 3 out of the 7 vulnerabilities. In the case of the remaining 4 vulnerabilities, ChatGPT exhibits a slight advantage over these tools. Finally, we analyzed the limitation of ChatGPT in smart contract vulnerability detection, revealing that the robustness of ChatGPT in this field needs to be improved from two aspects: its uncertainty in answering questions; and the limited length of the detected code. In general, our research provides insights into the strengths and weaknesses of employing large language models, specifically ChatGPT, for the detection of smart contract vulnerabilities.

翻译：随着区块链技术的发展，智能合约已成为区块链应用的重要组成部分。尽管其作用关键，但智能合约的开发可能引入漏洞，并可能导致严重后果，如财务损失。与此同时，以ChatGPT为代表的大语言模型受到广泛关注，在代码分析任务中展现出强大能力。本文通过实证研究，探讨ChatGPT在识别智能合约漏洞方面的性能。首先，我们使用公开的智能合约数据集评估ChatGPT的有效性。研究发现，虽然ChatGPT具有较高的召回率，但其定位智能合约漏洞的精确度有限。此外，ChatGPT在检测不同类型的漏洞时表现存在差异。我们深入分析了ChatGPT产生误报的根本原因，并将其归纳为四类。其次，通过将ChatGPT与其他先进的智能合约漏洞检测工具进行对比，我们发现对于7种漏洞中的3种，ChatGPT的F值低于其他工具；而对于其余4种漏洞，ChatGPT相较这些工具具有微弱优势。最后，我们分析了ChatGPT在智能合约漏洞检测中的局限性，揭示其在该领域的鲁棒性需从两个方面改进：回答问题的确定性不足，以及检测代码长度受限。总体而言，我们的研究为使用大语言模型（特别是ChatGPT）检测智能合约漏洞的优势与局限提供了洞见。