Combining Fine-Tuning and LLM-based Agents for Intuitive Smart Contract Auditing with Justifications

Smart contracts are decentralized applications built atop blockchains like Ethereum. Recent research has shown that large language models (LLMs) have potential in auditing smart contracts, but the state-of-the-art indicates that even GPT-4 can achieve only 30% precision (when both decision and justification are correct). This is likely because off-the-shelf LLMs were primarily pre-trained on a general text/code corpus and not fine-tuned on the specific domain of Solidity smart contract auditing. In this paper, we propose iAudit, a general framework that combines fine-tuning and LLM-based agents for intuitive smart contract auditing with justifications. Specifically, iAudit is inspired by the observation that expert human auditors first perceive what could be wrong and then perform a detailed analysis of the code to identify the cause. As such, iAudit employs a two-stage fine-tuning approach: it first tunes a Detector model to make decisions and then tunes a Reasoner model to generate causes of vulnerabilities. However, fine-tuning alone faces challenges in accurately identifying the optimal cause of a vulnerability. Therefore, we introduce two LLM-based agents, the Ranker and Critic, to iteratively select and debate the most suitable cause of vulnerability based on the output of the fine-tuned Reasoner model. To evaluate iAudit, we collected a balanced dataset with 1,734 positive and 1,810 negative samples to fine-tune iAudit. We then compared it with traditional fine-tuned models (CodeBERT, GraphCodeBERT, CodeT5, and UnixCoder) as well as prompt learning-based LLMs (GPT4, GPT-3.5, and CodeLlama-13b/34b). On a dataset of 263 real smart contract vulnerabilities, iAudit achieves an F1 score of 91.21% and an accuracy of 91.11%. The causes generated by iAudit achieved a consistency of about 38% compared to the ground truth causes.

翻译：智能合约是构建于以太坊等区块链之上的去中心化应用程序。近期研究表明，大语言模型在智能合约审计方面具有潜力，但现有技术水平表明，即使是GPT-4也仅能达到30%的精确率（要求决策与解释均正确）。这很可能是因为现成的大语言模型主要是在通用文本/代码语料上预训练，而未针对Solidity智能合约审计这一特定领域进行微调。本文提出iAudit——一个融合微调与基于LLM的智能体的通用框架，用于实现带解释的直观智能合约审计。具体而言，iAudit的设计灵感来源于专业人工审计员的工作模式：先感知可能存在的问题，再对代码进行详细分析以确定根本原因。因此，iAudit采用两阶段微调方法：首先微调检测器模型进行决策，随后微调推理器模型生成漏洞成因。然而，仅靠微调难以准确识别漏洞的最优成因。为此，我们引入两个基于LLM的智能体——排序器与批判器，基于微调后推理器模型的输出，通过迭代选择与辩论来确定最合适的漏洞成因。为评估iAudit，我们收集了包含1,734个正样本和1,810个负样本的平衡数据集进行微调，并将其与传统微调模型（CodeBERT、GraphCodeBERT、CodeT5和UnixCoder）以及基于提示学习的LLM（GPT4、GPT-3.5和CodeLlama-13b/34b）进行对比。在包含263个真实智能合约漏洞的数据集上，iAudit取得了91.21%的F1分数和91.11%的准确率。其生成的漏洞成因与真实标注的成因一致性达到约38%。