Existing malicious code detection techniques demand the integration of multiple tools to detect different malware patterns, often suffering from high misclassification rates. Therefore, malicious code detection techniques could be enhanced by adopting advanced, more automated approaches to achieve high accuracy and a low misclassification rate. The goal of this study is to aid security analysts in detecting malicious packages by empirically studying the effectiveness of Large Language Models (LLMs) in detecting malicious code. We present SocketAI, a malicious code review workflow to detect malicious code. To evaluate the effectiveness of SocketAI, we leverage a benchmark dataset of 5,115 npm packages, of which 2,180 packages have malicious code. We conducted a baseline comparison of GPT-3 and GPT-4 models with the state-of-the-art CodeQL static analysis tool, using 39 custom CodeQL rules developed in prior research to detect malicious Javascript code. We also compare the effectiveness of static analysis as a pre-screener with SocketAI workflow, measuring the number of files that need to be analyzed. and the associated costs. Additionally, we performed a qualitative study to understand the types of malicious activities detected or missed by our workflow. Our baseline comparison demonstrates a 16% and 9% improvement over static analysis in precision and F1 scores, respectively. GPT-4 achieves higher accuracy with 99% precision and 97% F1 scores, while GPT-3 offers a more cost-effective balance at 91% precision and 94% F1 scores. Pre-screening files with a static analyzer reduces the number of files requiring LLM analysis by 77.9% and decreases costs by 60.9% for GPT-3 and 76.1% for GPT-4. Our qualitative analysis identified data theft, suspicious domain connection, and arbitrary code execution as the top detected malicious activities.
翻译:现有的恶意代码检测技术需要整合多种工具来检测不同的恶意软件模式,通常存在较高的误分类率。因此,通过采用更先进、更自动化的方法,恶意代码检测技术可以得到增强,以实现高准确率和低误分类率。本研究的目标是通过实证研究大型语言模型(LLMs)在检测恶意代码方面的有效性,以帮助安全分析师检测恶意软件包。我们提出了SocketAI,一种用于检测恶意代码的恶意代码审查工作流程。为了评估SocketAI的有效性,我们利用了一个包含5,115个npm软件包的基准数据集,其中2,180个软件包含有恶意代码。我们将GPT-3和GPT-4模型与最先进的CodeQL静态分析工具进行了基线比较,使用了先前研究中开发的39条自定义CodeQL规则来检测恶意Javascript代码。我们还比较了静态分析作为预筛选器与SocketAI工作流程的有效性,测量了需要分析的文件数量及相关成本。此外,我们进行了一项定性研究,以了解我们的工作流程检测到或遗漏的恶意活动类型。我们的基线比较表明,在精确率和F1分数上,分别比静态分析提高了16%和9%。GPT-4实现了更高的准确率,达到99%的精确率和97%的F1分数,而GPT-3则提供了更具成本效益的平衡,精确率为91%,F1分数为94%。使用静态分析器预筛选文件,可将需要LLM分析的文件数量减少77.9%,并将GPT-3的成本降低60.9%,GPT-4的成本降低76.1%。我们的定性分析确定,数据窃取、可疑域名连接和任意代码执行是检测到的主要恶意活动。