Pre-training by Predicting Program Dependencies for Vulnerability Analysis Tasks

Vulnerability analysis is crucial for software security. This work focuses on using pre-training techniques to enhance the understanding of vulnerable code and boost vulnerability analysis. The code understanding ability of a pre-trained model is highly related to its pre-training objectives. The semantic structure, e.g., control and data dependencies, of code is important for vulnerability analysis. However, existing pre-training objectives either ignore such structure or focus on learning to use it. The feasibility and benefits of learning the knowledge of analyzing semantic structure have not been investigated. To this end, this work proposes two novel pre-training objectives, namely Control Dependency Prediction (CDP) and Data Dependency Prediction (DDP), which aim to predict the statement-level control dependencies and token-level data dependencies, respectively, in a code snippet only based on its source code. During pre-training, CDP and DDP can guide the model to learn the knowledge required for analyzing fine-grained dependencies in code. After pre-training, the pre-trained model can boost the understanding of vulnerable code during fine-tuning and can directly be used to perform dependence analysis for both partial and complete functions. To demonstrate the benefits of our pre-training objectives, we pre-train a Transformer model named PDBERT with CDP and DDP, fine-tune it on three vulnerability analysis tasks, i.e., vulnerability detection, vulnerability classification, and vulnerability assessment, and also evaluate it on program dependence analysis. Experimental results show that PDBERT benefits from CDP and DDP, leading to state-of-the-art performance on the three downstream tasks. Also, PDBERT achieves F1-scores of over 99% and 94% for predicting control and data dependencies, respectively, in partial and complete functions.

翻译：漏洞分析对于软件安全至关重要。本文聚焦于利用预训练技术增强对脆弱代码的理解并提升漏洞分析能力。预训练模型的代码理解能力与其预训练目标高度相关。代码的语义结构（如控制依赖和数据依赖）对漏洞分析具有重要意义，但现有预训练目标要么忽略此类结构，要么专注于学习如何使用它，尚未探索学习语义结构分析知识的可行性与收益。基于此，本文提出两种新型预训练目标——控制依赖预测与数据依赖预测，分别旨在仅依据代码片段的源代码预测语句级控制依赖和词元级数据依赖。在预训练阶段，CDP和DDP可引导模型学习分析代码细粒度依赖所需的知识。预训练后，该模型可在微调阶段增强对脆弱代码的理解，并可直接用于对部分函数和完整函数进行依赖分析。为论证预训练目标的优势，我们使用CDP和DDP预训练了名为PDBERT的Transformer模型，并在三个漏洞分析任务（漏洞检测、漏洞分类和漏洞评估）上进行微调，同时评估了其程序依赖分析性能。实验结果表明，PDBERT通过CDP和DDP获得了性能提升，在三个下游任务上实现了最优结果。此外，在预测部分函数和完整函数的控制依赖与数据依赖时，PDBERT的F1分数分别超过99%和94%。