Code-centric Learning-based Just-In-Time Vulnerability Detection

Attacks against computer systems exploiting software vulnerabilities can cause substantial damage to the cyber-infrastructure of our modern society and economy. To minimize the consequences, it is vital to detect and fix vulnerabilities as soon as possible. Just-in-time vulnerability detection (JIT-VD) discovers vulnerability-prone ("dangerous") commits to prevent them from being merged into source code and causing vulnerabilities. By JIT-VD, the commits' authors, who understand the commits properly, can review these dangerous commits and fix them if necessary while the relevant modifications are still fresh in their minds. In this paper, we propose CodeJIT, a novel code-centric learning-based approach for just-in-time vulnerability detection. The key idea of CodeJIT is that the meaning of the code changes of a commit is the direct and deciding factor for determining if the commit is dangerous for the code. Based on that idea, we design a novel graph-based representation to represent the semantics of code changes in terms of both code structures and program dependencies. A graph neural network model is developed to capture the meaning of the code changes represented by our graph-based representation and learn to discriminate between dangerous and safe commits. We conducted experiments to evaluate the JIT-VD performance of CodeJIT on a dataset of 20K+ dangerous and safe commits in 506 real-world projects from 1998 to 2022. Our results show that CodeJIT significantly improves the state-of-the-art JIT-VD methods by up to 66% in Recall, 136% in Precision, and 68% in F1. Moreover, CodeJIT correctly classifies nearly 9/10 of dangerous/safe (benign) commits and even detects 69 commits that fix a vulnerability yet produce other issues in source code

翻译：对利用软件漏洞进行的计算机系统攻击可能给现代社会和经济的基础设施造成重大损害。为减少后果，尽早检测并修复漏洞至关重要。即时漏洞检测（JIT-VD）旨在发现易产生漏洞（“危险”）的代码提交，以防其被合并到源代码中并引发漏洞。通过JIT-VD，能够充分理解这些提交的作者可在相关修改记忆犹新时审查这些危险提交并及时修复。本文提出CodeJIT——一种新颖的以代码为中心的学习式即时漏洞检测方法。CodeJIT的核心思想是：代码提交中变更内容的含义是判断该提交是否对代码构成危险的直接决定因素。基于此，我们设计了一种新颖的基于图的表示方法，以同时体现代码结构及程序依赖性来表征代码变更的语义。我们开发了图神经网络模型，用于捕捉由该图表示所描述的代码变更含义，并学习区分危险提交与安全提交。我们通过实验评估了CodeJIT在包含1998年至2022年间506个真实项目、超过2万个危险与安全提交数据集上的JIT-VD性能。结果表明，CodeJIT在召回率、精确率和F1值上分别较现有最优JIT-VD方法提升高达66%、136%和68%。此外，CodeJIT正确分类了接近十分之九的危险/安全（良性）提交，甚至检测出69个虽修复漏洞却引入其他问题的提交。