Combining Fine-Tuning and LLM-based Agents for Intuitive Smart Contract Auditing with Justifications

Smart contracts are decentralized applications built atop blockchains like Ethereum. Recent research has shown that large language models (LLMs) have potential in auditing smart contracts, but the state-of-the-art indicates that even GPT-4 can achieve only 30% precision (when both decision and justification are correct). This is likely because off-the-shelf LLMs were primarily pre-trained on a general text/code corpus and not fine-tuned on the specific domain of Solidity smart contract auditing. In this paper, we propose TrustLLM, a general framework that combines fine-tuning and LLM-based agents for intuitive smart contract auditing with justifications. Specifically, TrustLLM is inspired by the observation that expert human auditors first perceive what could be wrong and then perform a detailed analysis of the code to identify the cause. As such, TrustLLM employs a two-stage fine-tuning approach: it first tunes a Detector model to make decisions and then tunes a Reasoner model to generate causes of vulnerabilities. However, fine-tuning alone faces challenges in accurately identifying the optimal cause of a vulnerability. Therefore, we introduce two LLM-based agents, the Ranker and Critic, to iteratively select and debate the most suitable cause of vulnerability based on the output of the fine-tuned Reasoner model. To evaluate TrustLLM, we collected a balanced dataset with 1,734 positive and 1,810 negative samples to fine-tune TrustLLM. We then compared it with traditional fine-tuned models (CodeBERT, GraphCodeBERT, CodeT5, and UnixCoder) as well as prompt learning-based LLMs (GPT4, GPT-3.5, and CodeLlama-13b/34b). On a dataset of 263 real smart contract vulnerabilities, TrustLLM achieves an F1 score of 91.21% and an accuracy of 91.11%. The causes generated by TrustLLM achieved a consistency of about 38% compared to the ground truth causes.

翻译：智能合约是构建在以太坊等区块链上的去中心化应用。近期研究表明，大语言模型在审计智能合约方面具有潜力，但现有技术显示，即使是GPT-4也只能达到30%的精确率（当决策和理由均正确时）。这很可能是因为现成的大语言模型主要基于通用文本/代码语料库进行预训练，而未针对Solidity智能合约审计这一特定领域进行微调。本文提出TrustLLM这一通用框架，结合微调与基于大语言模型的智能体，实现直观且具有理由的智能合约审计。具体而言，TrustLLM受人类专家审计员先感知潜在问题、再对代码进行详细分析以确定原因的观察启发。因此，TrustLLM采用两阶段微调方法：首先微调解码器模型以做出决策，然后微调解码器模型以生成漏洞原因。然而，仅依靠微调在准确识别漏洞最优原因方面面临挑战。为此，我们引入两个基于大语言模型的智能体——排序器与评判器，基于微调解码器模型的输出，迭代选择并辩论漏洞的最适配原因。为评估TrustLLM，我们收集了一个包含1734个正样本和1810个负样本的平衡数据集用于微调，并将其与传统微调模型（CodeBERT、GraphCodeBERT、CodeT5和UnixCoder）以及基于提示学习的大语言模型（GPT4、GPT-3.5和CodeLlama-13b/34b）进行比较。在包含263个真实智能合约漏洞的数据集上，TrustLLM达到了91.21%的F1分数和91.11%的准确率。其生成的原因与真实原因的一致性约为38%。