In this paper, we test the hypothesis that although OpenAI's GPT-4 performs well generally, we can fine-tune open-source models to outperform GPT-4 in smart contract vulnerability detection. We fine-tune two models from Meta's Code Llama and a dataset of 17k prompts, Detect Llama - Foundation and Detect Llama - Instruct, and we also fine-tune OpenAI's GPT-3.5 Turbo model (GPT-3.5FT). We then evaluate these models, plus a random baseline, on a testset we develop against GPT-4, and GPT-4 Turbo's, detection of eight vulnerabilities from the dataset and the two top identified vulnerabilities - and their weighted F1 scores. We find that for binary classification (i.e., is this smart contract vulnerable?), our two best-performing models, GPT-3.5FT and Detect Llama - Foundation, achieve F1 scores of $0.776$ and $0.68$, outperforming both GPT-4 and GPT-4 Turbo, $0.66$ and $0.675$. For the evaluation against individual vulnerability identification, our top two models, GPT-3.5FT and Detect Llama - Foundation, both significantly outperformed GPT-4 and GPT-4 Turbo in both weighted F1 for all vulnerabilities ($0.61$ and $0.56$ respectively against GPT-4's $0.218$ and GPT-4 Turbo's $0.243$) and weighted F1 for the top two identified vulnerabilities ($0.719$ for GPT-3.5FT, $0.674$ for Detect Llama - Foundation against GPT-4's $0.363$ and GPT-4 Turbo's $0.429$).
翻译:本文验证了以下假设:尽管OpenAI的GPT-4在通用任务上表现优异,但通过对开源模型进行微调,可在智能合约漏洞检测任务上超越GPT-4。我们基于Meta的Code Llama框架与包含1.7万条提示的数据集,微调出两个模型——Detect Llama - Foundation与Detect Llama - Instruct,同时微调了OpenAI的GPT-3.5 Turbo模型(GPT-3.5FT)。随后,我们在自主构建的测试集上评估这些模型及随机基线模型,并与GPT-4及GPT-4 Turbo进行对比,测试范围涵盖数据集中八类漏洞及两种高发漏洞的检测能力,并计算加权F1分数。实验表明:在二元分类任务(即判断智能合约是否存在漏洞)中,表现最佳的两个模型GPT-3.5FT与Detect Llama - Foundation分别获得$0.776$和$0.68$的F1分数,均优于GPT-4的$0.66$与GPT-4 Turbo的$0.675$。在针对具体漏洞类型的识别任务中,GPT-3.5FT与Detect Llama - Foundation在两项指标上均显著超越GPT-4与GPT-4 Turbo:全漏洞加权F1分数分别达到$0.61$和$0.56$(对比GPT-4的$0.218$与GPT-4 Turbo的$0.243$);高发漏洞加权F1分数分别为$0.719$(GPT-3.5FT)与$0.674$(Detect Llama - Foundation),而GPT-4与GPT-4 Turbo仅获得$0.363$和$0.429$。