Program analysis tools often produce large volumes of candidate vulnerability reports that require costly manual review, creating a practical challenge: how can security analysts prioritize the reports most likely to be true vulnerabilities? This paper investigates whether machine learning can be applied to prioritizing vulnerabilities reported by program analysis tools. We focus on Node.js packages and collect a benchmark of 1,883 Node.js packages, each containing one reported ACE or ACI vulnerability. We evaluate a variety of machine learning approaches, including classical models, graph neural networks (GNNs), large language models (LLMs), and hybrid models that combine GNN and LLMs, trained on data based on a dynamic program analysis tool's output. The top LLM achieves $F_{1} {=} 0.915$, while the best GNN and classical ML models reaching $F_{1} {=} 0.904$. At a less than 7% false-negative rate, the leading model eliminates 66.9% of benign packages from manual review, taking around 60 ms per package. If the best model is tuned to operate at a precision level of 0.8 (i.e., allowing 20% false positives amongst all warnings), our approach can detect 99.2% of exploitable taint flows while missing only 0.8%, demonstrating strong potential for real-world vulnerability triage.
翻译:程序分析工具通常会产生大量候选漏洞报告,这些报告需要耗费大量人力进行手动审查,这带来了一个实际挑战:安全分析师应如何优先处理最可能为真实漏洞的报告?本文研究了机器学习是否可用于对程序分析工具报告的漏洞进行优先级排序。我们聚焦于Node.js包,收集了包含1,883个Node.js包的基准数据集,每个包均含有一个已报告的ACE或ACI漏洞。我们评估了多种机器学习方法,包括经典模型、图神经网络(GNNs)、大语言模型(LLMs)以及结合GNN与LLMs的混合模型,这些模型均基于动态程序分析工具输出的数据进行训练。最优LLM模型达到$F_{1} {=} 0.915$,而最佳GNN与经典ML模型达到$F_{1} {=} 0.904$。在假阴性率低于7%的条件下,领先模型可从手动审查中排除66.9%的良性包,每个包处理时间约60毫秒。若将最佳模型调整至精度为0.8的运行状态(即允许所有警告中存在20%的假阳性),我们的方法能检测99.2%的可利用污染流,仅遗漏0.8%,展现出在实际漏洞分诊中强大的应用潜力。