Most existing pre-trained language models for source code focus on learning the static code text, typically augmented with static code structures (abstract syntax tree, dependency graphs, etc.). However, program semantics will not be fully exposed before the real execution. Without an understanding of the program execution, statically pre-trained models fail to comprehensively capture the dynamic code properties, such as the branch coverage and the runtime variable values, and they are consequently less effective at code understanding tasks, such as retrieving semantic clones and detecting software vulnerabilities. To close the gap between the static nature of language models and the dynamic characteristics of programs, we introduce TRACED, an execution-aware pre-training strategy for source code. Specifically, we pre-train code language models with a combination of source code, executable inputs, and corresponding execution traces. Our goal is to teach code models the complicated execution logic during the pre-training, enabling the model to statically estimate the dynamic code properties without repeatedly executing code during task-specific fine-tuning. To illustrate the effectiveness of our proposed approach, we fine-tune and evaluate TRACED on three downstream tasks: static execution estimation, clone retrieval, and vulnerability detection. The empirical results show that TRACED relatively improves the statically pre-trained code models by 12.4% for complete execution path prediction and by 25.2% for runtime variable value predictions. TRACED also significantly outperforms statically pre-trained models in clone retrieval and vulnerability detection across four public benchmarks.
翻译:现有的大多数源代码预训练语言模型侧重于学习静态代码文本,通常辅以静态代码结构(抽象语法树、依赖图等)。然而,程序语义在真正执行之前无法完全暴露。由于缺乏对程序执行的理解,静态预训练模型难以全面捕获动态代码属性(如分支覆盖率和运行时变量值),因此在代码理解任务(如检索语义克隆和检测软件漏洞)中效果欠佳。为了弥合语言模型的静态特性与程序的动态特性之间的差距,我们提出了TRACED——一种面向源代码的执行感知预训练策略。具体而言,我们结合源代码、可执行输入及相应的执行轨迹对代码语言模型进行预训练。目标是在预训练阶段教导代码模型复杂的执行逻辑,使其能够在任务级微调期间无需反复执行代码即可静态估计动态代码属性。为展示所提方法的有效性,我们在三项下游任务(静态执行估计、克隆检索和漏洞检测)上对TRACED进行微调与评估。实验结果表明,TRACED在完整执行路径预测上相对静态预训练代码模型提升12.4%,在运行时变量值预测上提升25.2%。此外,TRACED在四个公开基准的克隆检索和漏洞检测任务中显著优于静态预训练模型。