Smart contracts are decentralized applications built atop blockchains like Ethereum. Recent research has shown that large language models (LLMs) have potential in auditing smart contracts, but the state-of-the-art indicates that even GPT-4 can achieve only 30% precision (when both decision and justification are correct). This is likely because off-the-shelf LLMs were primarily pre-trained on a general text/code corpus and not fine-tuned on the specific domain of Solidity smart contract auditing. In this paper, we propose iAudit, a general framework that combines fine-tuning and LLM-based agents for intuitive smart contract auditing with justifications. Specifically, iAudit is inspired by the observation that expert human auditors first perceive what could be wrong and then perform a detailed analysis of the code to identify the cause. As such, iAudit employs a two-stage fine-tuning approach: it first tunes a Detector model to make decisions and then tunes a Reasoner model to generate causes of vulnerabilities. However, fine-tuning alone faces challenges in accurately identifying the optimal cause of a vulnerability. Therefore, we introduce two LLM-based agents, the Ranker and Critic, to iteratively select and debate the most suitable cause of vulnerability based on the output of the fine-tuned Reasoner model. To evaluate iAudit, we collected a balanced dataset with 1,734 positive and 1,810 negative samples to fine-tune iAudit. We then compared it with traditional fine-tuned models (CodeBERT, GraphCodeBERT, CodeT5, and UnixCoder) as well as prompt learning-based LLMs (GPT4, GPT-3.5, and CodeLlama-13b/34b). On a dataset of 263 real smart contract vulnerabilities, iAudit achieves an F1 score of 91.21% and an accuracy of 91.11%. The causes generated by iAudit achieved a consistency of about 38% compared to the ground truth causes.
翻译:智能合约是构建于以太坊等区块链之上的去中心化应用程序。近期研究表明,大语言模型在智能合约审计方面具有潜力,但现有技术水平表明,即使是GPT-4也仅能达到30%的精确率(要求决策与解释均正确)。这很可能是因为现成的大语言模型主要是在通用文本/代码语料上预训练,而未针对Solidity智能合约审计这一特定领域进行微调。本文提出iAudit——一个融合微调与基于LLM的智能体的通用框架,用于实现带解释的直观智能合约审计。具体而言,iAudit的设计灵感来源于专业人工审计员的工作模式:先感知可能存在的问题,再对代码进行详细分析以确定根本原因。因此,iAudit采用两阶段微调方法:首先微调检测器模型进行决策,随后微调推理器模型生成漏洞成因。然而,仅靠微调难以准确识别漏洞的最优成因。为此,我们引入两个基于LLM的智能体——排序器与批判器,基于微调后推理器模型的输出,通过迭代选择与辩论来确定最合适的漏洞成因。为评估iAudit,我们收集了包含1,734个正样本和1,810个负样本的平衡数据集进行微调,并将其与传统微调模型(CodeBERT、GraphCodeBERT、CodeT5和UnixCoder)以及基于提示学习的LLM(GPT4、GPT-3.5和CodeLlama-13b/34b)进行对比。在包含263个真实智能合约漏洞的数据集上,iAudit取得了91.21%的F1分数和91.11%的准确率。其生成的漏洞成因与真实标注的成因一致性达到约38%。