Enhancing Reverse Engineering: Investigating and Benchmarking Large Language Models for Vulnerability Analysis in Decompiled Binaries

Security experts reverse engineer (decompile) binary code to identify critical security vulnerabilities. The limited access to source code in vital systems - such as firmware, drivers, and proprietary software used in Critical Infrastructures (CI) - makes this analysis even more crucial on the binary level. Even with available source code, a semantic gap persists after compilation between the source and the binary code executed by the processor. This gap may hinder the detection of vulnerabilities in source code. That being said, current research on Large Language Models (LLMs) overlooks the significance of decompiled binaries in this area by focusing solely on source code. In this work, we are the first to empirically uncover the substantial semantic limitations of state-of-the-art LLMs when it comes to analyzing vulnerabilities in decompiled binaries, largely due to the absence of relevant datasets. To bridge the gap, we introduce DeBinVul, a novel decompiled binary code vulnerability dataset. Our dataset is multi-architecture and multi-optimization, focusing on C/C++ due to their wide usage in CI and association with numerous vulnerabilities. Specifically, we curate 150,872 samples of vulnerable and non-vulnerable decompiled binary code for the task of (i) identifying; (ii) classifying; (iii) describing vulnerabilities; and (iv) recovering function names in the domain of decompiled binaries. Subsequently, we fine-tune state-of-the-art LLMs using DeBinVul and report on a performance increase of 19%, 24%, and 21% in the capabilities of CodeLlama, Llama3, and CodeGen2 respectively, in detecting binary code vulnerabilities. Additionally, using DeBinVul, we report a high performance of 80-90% on the vulnerability classification task. Furthermore, we report improved performance in function name recovery and vulnerability description tasks.

翻译：安全专家通过逆向工程（反编译）二进制代码来识别关键安全漏洞。在关键基础设施（CI）中使用的固件、驱动程序和专有软件等重要系统中，源代码的可访问性有限，使得在二进制层面进行分析变得尤为关键。即使源代码可用，编译后源代码与处理器执行的二进制代码之间仍存在语义鸿沟。这一鸿沟可能阻碍源代码中漏洞的检测。然而，当前关于大语言模型（LLMs）的研究忽视了反编译二进制在此领域的重要性，仅专注于源代码分析。在本研究中，我们首次通过实证揭示了最先进的大语言模型在分析反编译二进制漏洞时存在的显著语义局限性，这主要源于相关数据集的缺失。为弥补这一差距，我们引入了DeBinVul——一个新颖的反编译二进制代码漏洞数据集。我们的数据集具有多架构和多优化级别的特点，重点关注C/C++语言，因其在CI中的广泛应用以及与大量漏洞的关联。具体而言，我们收集了150,872个包含漏洞和无漏洞的反编译二进制代码样本，用于以下任务：(i) 漏洞识别；(ii) 漏洞分类；(iii) 漏洞描述；以及(iv) 反编译二进制领域中的函数名恢复。随后，我们使用DeBinVul对最先进的大语言模型进行微调，并报告了CodeLlama、Llama3和CodeGen2在检测二进制代码漏洞能力上分别提升19%、24%和21%的性能表现。此外，利用DeBinVul，我们在漏洞分类任务中实现了80-90%的高性能表现。同时，我们在函数名恢复和漏洞描述任务中也报告了改进的性能。