VGX: Large-Scale Sample Generation for Boosting Learning-Based Software Vulnerability Analyses

Accompanying the successes of learning-based defensive software vulnerability analyses is the lack of large and quality sets of labeled vulnerable program samples, which impedes further advancement of those defenses. Existing automated sample generation approaches have shown potentials yet still fall short of practical expectations due to the high noise in the generated samples. This paper proposes VGX, a new technique aimed for large-scale generation of high-quality vulnerability datasets. Given a normal program, VGX identifies the code contexts in which vulnerabilities can be injected, using a customized Transformer featured with a new value-flowbased position encoding and pre-trained against new objectives particularly for learning code structure and context. Then, VGX materializes vulnerability-injection code editing in the identified contexts using patterns of such edits obtained from both historical fixes and human knowledge about real-world vulnerabilities. Compared to four state-of-the-art (SOTA) baselines (pattern-, Transformer-, GNN-, and pattern+Transformer-based), VGX achieved 99.09-890.06% higher F1 and 22.45%-328.47% higher label accuracy. For in-the-wild sample production, VGX generated 150,392 vulnerable samples, from which we randomly chose 10% to assess how much these samples help vulnerability detection, localization, and repair. Our results show SOTA techniques for these three application tasks achieved 19.15-330.80% higher F1, 12.86-19.31% higher top-10 accuracy, and 85.02-99.30% higher top-50 accuracy, respectively, by adding those samples to their original training data. These samples also helped a SOTA vulnerability detector discover 13 more real-world vulnerabilities (CVEs) in critical systems (e.g., Linux kernel) that would be missed by the original model.

翻译：伴随着基于学习的防御性软件漏洞分析取得的成功，高质量、大规模标记漏洞程序样本的缺乏制约了此类防御技术的进一步发展。现有自动化样本生成方法虽展现出潜力，但因生成样本中噪声过高而难以满足实际应用需求。本文提出VGX技术，旨在大规模生成高质量漏洞数据集。对于正常程序，VGX通过定制化Transformer识别可注入漏洞的代码上下文，该Transformer采用基于值的流位置编码，并针对学习代码结构与上下文的特定目标进行预训练。随后，VGX利用从历史修复及真实世界漏洞人类知识中提取的编辑模式，在识别出的上下文中实现漏洞注入代码编辑。与四种最先进基线方法（基于模式、Transformer、图神经网络及模式+Transformer结合）相比，VGX的F1值提升99.09%-890.06%，标签准确率提升22.45%-328.47%。在真实场景样本生产中，VGX生成150,392个漏洞样本，我们随机选取其中10%评估其对漏洞检测、定位与修复的辅助效果。结果表明，将这些样本加入原始训练数据后，上述三种应用场景的最新技术F1值分别提升19.15%-330.80%、Top-10准确率提升12.86%-19.31%、Top-50准确率提升85.02%-99.30%。此外，这些样本帮助最先进的漏洞检测器在关键系统（如Linux内核）中发现13个原始模型遗漏的真实世界漏洞（CVE）。