Evaluation of ChatGPT's Smart Contract Auditing Capabilities Based on Chain of Thought

Smart contracts, as a key component of blockchain technology, play a crucial role in ensuring the automation of transactions and adherence to protocol rules. However, smart contracts are susceptible to security vulnerabilities, which, if exploited, can lead to significant asset losses. This study explores the potential of enhancing smart contract security audits using the GPT-4 model. We utilized a dataset of 35 smart contracts from the SolidiFI-benchmark vulnerability library, containing 732 vulnerabilities, and compared it with five other vulnerability detection tools to evaluate GPT-4's ability to identify seven common types of vulnerabilities. Moreover, we assessed GPT-4's performance in code parsing and vulnerability capture by simulating a professional auditor's auditing process using CoT(Chain of Thought) prompts based on the audit reports of eight groups of smart contracts. We also evaluated GPT-4's ability to write Solidity Proof of Concepts (PoCs). Through experimentation, we found that GPT-4 performed poorly in detecting smart contract vulnerabilities, with a high Precision of 96.6%, but a low Recall of 37.8%, and an F1-score of 41.1%, indicating a tendency to miss vulnerabilities during detection. Meanwhile, it demonstrated good contract code parsing capabilities, with an average comprehensive score of 6.5, capable of identifying the background information and functional relationships of smart contracts; in 60% of the cases, it could write usable PoCs, suggesting GPT-4 has significant potential application in PoC writing. These experimental results indicate that GPT-4 lacks the ability to detect smart contract vulnerabilities effectively, but its performance in contract code parsing and PoC writing demonstrates its significant potential as an auxiliary tool in enhancing the efficiency and effectiveness of smart contract security audits.

翻译：智能合约作为区块链技术的关键组成部分，在确保交易自动化和遵守协议规则方面发挥着至关重要的作用。然而，智能合约易受安全漏洞影响，一旦被利用可能导致重大资产损失。本研究探索了利用GPT-4模型增强智能合约安全审计的潜力。我们使用了来自SolidiFI-benchmark漏洞库的35个智能合约数据集（包含732个漏洞），并将其与五种其他漏洞检测工具进行对比，以评估GPT-4识别七种常见漏洞类型的能力。此外，我们通过模拟专业审计员的审计流程，基于八组智能合约的审计报告使用CoT(思维链)提示，评估了GPT-4在代码解析和漏洞捕获方面的表现。我们还评估了GPT-4编写Solidity概念验证代码的能力。通过实验发现，GPT-4在检测智能合约漏洞方面表现欠佳：精确率高达96.6%，但召回率仅为37.8%，F1得分仅41.1%，表明其在检测过程中容易遗漏漏洞；同时在合约代码解析方面表现良好，综合评分均值达6.5，能够识别智能合约的背景信息和功能关系；在60%的情况下能编写可用的概念验证代码，表明GPT-4在概念验证编写方面具有显著应用潜力。这些实验结果表明，GPT-4尚不具备有效检测智能合约漏洞的能力，但其在合约代码解析和概念验证编写方面的表现，展现了其作为辅助工具提升智能合约安全审计效率与效果的重要潜力。