Smart contracts are decentralized applications built atop blockchains like Ethereum. Recent research has shown that large language models (LLMs) have potential in auditing smart contracts, but the state-of-the-art indicates that even GPT-4 can achieve only 30% precision (when both decision and justification are correct). This is likely because off-the-shelf LLMs were primarily pre-trained on a general text/code corpus and not fine-tuned on the specific domain of Solidity smart contract auditing. In this paper, we propose TrustLLM, a general framework that combines fine-tuning and LLM-based agents for intuitive smart contract auditing with justifications. Specifically, TrustLLM is inspired by the observation that expert human auditors first perceive what could be wrong and then perform a detailed analysis of the code to identify the cause. As such, TrustLLM employs a two-stage fine-tuning approach: it first tunes a Detector model to make decisions and then tunes a Reasoner model to generate causes of vulnerabilities. However, fine-tuning alone faces challenges in accurately identifying the optimal cause of a vulnerability. Therefore, we introduce two LLM-based agents, the Ranker and Critic, to iteratively select and debate the most suitable cause of vulnerability based on the output of the fine-tuned Reasoner model. To evaluate TrustLLM, we collected a balanced dataset with 1,734 positive and 1,810 negative samples to fine-tune TrustLLM. We then compared it with traditional fine-tuned models (CodeBERT, GraphCodeBERT, CodeT5, and UnixCoder) as well as prompt learning-based LLMs (GPT4, GPT-3.5, and CodeLlama-13b/34b). On a dataset of 263 real smart contract vulnerabilities, TrustLLM achieves an F1 score of 91.21% and an accuracy of 91.11%. The causes generated by TrustLLM achieved a consistency of about 38% compared to the ground truth causes.
翻译:智能合约是构建在以太坊等区块链上的去中心化应用。近期研究表明,大语言模型在审计智能合约方面具有潜力,但现有技术显示,即使是GPT-4也只能达到30%的精确率(当决策和理由均正确时)。这很可能是因为现成的大语言模型主要基于通用文本/代码语料库进行预训练,而未针对Solidity智能合约审计这一特定领域进行微调。本文提出TrustLLM这一通用框架,结合微调与基于大语言模型的智能体,实现直观且具有理由的智能合约审计。具体而言,TrustLLM受人类专家审计员先感知潜在问题、再对代码进行详细分析以确定原因的观察启发。因此,TrustLLM采用两阶段微调方法:首先微调解码器模型以做出决策,然后微调解码器模型以生成漏洞原因。然而,仅依靠微调在准确识别漏洞最优原因方面面临挑战。为此,我们引入两个基于大语言模型的智能体——排序器与评判器,基于微调解码器模型的输出,迭代选择并辩论漏洞的最适配原因。为评估TrustLLM,我们收集了一个包含1734个正样本和1810个负样本的平衡数据集用于微调,并将其与传统微调模型(CodeBERT、GraphCodeBERT、CodeT5和UnixCoder)以及基于提示学习的大语言模型(GPT4、GPT-3.5和CodeLlama-13b/34b)进行比较。在包含263个真实智能合约漏洞的数据集上,TrustLLM达到了91.21%的F1分数和91.11%的准确率。其生成的原因与真实原因的一致性约为38%。