Detecting vulnerabilities is vital for software security, yet deep learning-based vulnerability detectors (DLVD) face a data shortage, which limits their effectiveness. Data augmentation can potentially alleviate the data shortage, but augmenting vulnerable code is challenging and requires a generative solution that maintains vulnerability. Previous works have only focused on generating samples that contain single statements or specific types of vulnerabilities. Recently, large language models (LLMs) have been used to solve various code generation and comprehension tasks with inspiring results, especially when fused with retrieval augmented generation (RAG). Therefore, we propose VulScribeR, a novel LLM-based solution that leverages carefully curated prompt templates to augment vulnerable datasets. More specifically, we explore three strategies to augment both single and multi-statement vulnerabilities, with LLMs, namely Mutation, Injection, and Extension. Our extensive evaluation across three vulnerability datasets and DLVD models, using two LLMs, show that our approach beats two SOTA methods Vulgen and VGX, and Random Oversampling (ROS) by 27.48%, 27.93%, and 15.41% in f1-score with 5K generated vulnerable samples on average, and 53.84%, 54.10%, 69.90%, and 40.93% with 15K generated vulnerable samples. Our approach demonstrates its feasibility for large-scale data augmentation by generating 1K samples at as cheap as US$ 1.88.
翻译:漏洞检测对软件安全至关重要,然而基于深度学习的漏洞检测器面临数据短缺问题,这限制了其有效性。数据增强可能缓解数据短缺,但增强漏洞代码具有挑战性,需要一种能够保持漏洞特性的生成式解决方案。先前的研究仅专注于生成包含单条语句或特定类型漏洞的样本。近期,大型语言模型已被用于解决各种代码生成与理解任务并取得鼓舞人心的成果,尤其是在与检索增强生成技术融合时。因此,我们提出VulScribeR,一种基于LLM的创新解决方案,利用精心设计的提示模板来增强漏洞数据集。具体而言,我们探索了三种基于LLM增强单语句及多语句漏洞的策略:变异、注入和扩展。我们在三个漏洞数据集和DLVD模型上使用两种LLM进行的广泛评估表明,在平均生成5K个漏洞样本时,我们的方法在F1分数上分别超过最先进方法Vulgen、VGX和随机过采样27.48%、27.93%和15.41%;在生成15K个漏洞样本时,分别超过53.84%、54.10%、69.90%和40.93%。我们的方法能以低至1.88美元的成本生成1K样本,证明了其进行大规模数据增强的可行性。