Deep learning (DL) has been a common thread across several recent techniques for vulnerability detection. The rise of large, publicly available datasets of vulnerabilities has fueled the learning process underpinning these techniques. While these datasets help the DL-based vulnerability detectors, they also constrain these detectors' predictive abilities. Vulnerabilities in these datasets have to be represented in a certain way, e.g., code lines, functions, or program slices within which the vulnerabilities exist. We refer to this representation as a base unit. The detectors learn how base units can be vulnerable and then predict whether other base units are vulnerable. We have hypothesized that this focus on individual base units harms the ability of the detectors to properly detect those vulnerabilities that span multiple base units (or MBU vulnerabilities). For vulnerabilities such as these, a correct detection occurs when all comprising base units are detected as vulnerable. Verifying how existing techniques perform in detecting all parts of a vulnerability is important to establish their effectiveness for other downstream tasks. To evaluate our hypothesis, we conducted a study focusing on three prominent DL-based detectors: ReVeal, DeepWukong, and LineVul. Our study shows that all three detectors contain MBU vulnerabilities in their respective datasets. Further, we observed significant accuracy drops when detecting these types of vulnerabilities. We present our study and a framework that can be used to help DL-based detectors toward the proper inclusion of MBU vulnerabilities.
翻译:深度学习(DL)已成为近期多项漏洞检测技术的共同核心。大型公开漏洞数据集的涌现推动了这些技术的学习进程。尽管这些数据集为基于深度学习的漏洞检测器提供了支撑,但也限制了检测器的预测能力。数据集中的漏洞必须以特定形式呈现,例如代码行、函数或包含漏洞的程序切片。我们将这种呈现形式称为基础单元。检测器通过学习基础单元的脆弱性特征,进而预测其他基础单元是否包含漏洞。我们假设,对单个基础单元的聚焦会损害检测器正确识别跨多个基础单元(即多基础单元漏洞,MBU漏洞)的能力。对于此类漏洞,只有当所有构成的基础单元均被检测为脆弱时,才算正确识别。验证现有技术检测漏洞全貌的性能,对于评估其在其他下游任务中的有效性至关重要。为验证该假设,我们基于三种主流深度学习检测器(ReVeal、DeepWukong和LineVul)展开研究。结果表明,三者各自的数据集中均包含MBU漏洞。进一步研究发现,检测此类漏洞时准确率显著下降。本文呈现了该研究及其框架,该框架可用于帮助基于深度学习的检测器合理纳入MBU漏洞。