Large Language Models (LLMs) have shown great promise in vulnerability identification. As C/C++ comprises half of the Open-Source Software (OSS) vulnerabilities over the past decade and updates in OSS mainly occur through commits, enhancing LLMs' ability to identify C/C++ Vulnerability-Contributing Commits (VCCs) is essential. However, current studies primarily focus on further pre-training LLMs on massive code datasets, which is resource-intensive and poses efficiency challenges. In this paper, we enhance the ability of BERT-based LLMs to identify C/C++ VCCs in a lightweight manner. We propose CodeLinguaNexus (CLNX) as a bridge facilitating communication between C/C++ programs and LLMs. Based on commits, CLNX efficiently converts the source code into a more natural representation while preserving key details. Specifically, CLNX first applies structure-level naturalization to decompose complex programs, followed by token-level naturalization to interpret complex symbols. We evaluate CLNX on public datasets of 25,872 C/C++ functions with their commits. The results show that CLNX significantly enhances the performance of LLMs on identifying C/C++ VCCs. Moreover, CLNX-equipped CodeBERT achieves new state-of-the-art and identifies 38 OSS vulnerabilities in the real world.
翻译:大型语言模型(LLMs)在漏洞识别方面展现出巨大潜力。由于过去十年中C/C++语言构成了开源软件(OSS)漏洞的半数来源,且OSS的更新主要通过代码提交实现,因此提升LLMs识别C/C++漏洞贡献提交(VCCs)的能力至关重要。然而,现有研究主要集中于对海量代码数据集进行进一步预训练LLMs,这种方法资源消耗巨大且存在效率挑战。本文以轻量级方式增强基于BERT的LLMs识别C/C++ VCCs的能力。我们提出CodeLinguaNexus(CLNX)作为连接C/C++程序与LLMs的桥梁。基于代码提交,CLNX在保留关键细节的同时,将源代码高效转换为更接近自然语言的表征形式。具体而言,CLNX首先通过结构层面自然化分解复杂程序,随后通过词元层面自然化解析复杂符号。我们在包含25,872个C/C++函数及其提交的公开数据集上评估CLNX。结果表明,CLNX显著提升了LLMs识别C/C++ VCCs的性能。此外,搭载CLNX的CodeBERT创造了新的最优性能记录,并在现实场景中成功识别出38个OSS漏洞。