The prevalence of malicious packages in open-source repositories, such as PyPI, poses a critical threat to the software supply chain. While Large Language Models (LLMs) have emerged as a promising tool for automated security tasks, their effectiveness in detecting malicious packages and indicators remains underexplored. This paper presents a systematic evaluation of 13 LLMs for detecting malicious software packages. Using a curated dataset of 4,070 packages (3,700 benign and 370 malicious), we evaluate model performance across two tasks: binary classification (package detection) and multi-label classification (identification of specific malicious indicators). We further investigate the impact of prompting strategies, temperature settings, and model specifications on detection accuracy. We find a significant "granularity gap" in LLMs' capabilities. While GPT-4.1 achieves near-perfect performance in binary detection (F1 $\approx$ 0.99), performance degrades by approximately 41\% when the task shifts to identifying specific malicious indicators. We observe that general models are best for filtering out the majority of threats, while specialized coder models are better at detecting attacks that follow a strict, predictable code structure. Our correlation analysis indicates that parameter size and context width have negligible explanatory power regarding detection accuracy. We conclude that while LLMs are powerful detectors at the package level, they lack the semantic depth required for precise identification at the granular indicator level.
翻译:开源存储库(如PyPI)中恶意软件包的盛行对软件供应链构成严重威胁。尽管大语言模型已成为自动化安全任务中极具前景的工具,但其在检测恶意软件包及攻击特征方面的有效性仍未得到充分探索。本文对13种大语言模型在恶意软件包检测方面进行了系统性评估。基于精心构建的包含4,070个软件包(3,700个良性包与370个恶意包)的数据集,我们评估了模型在两项任务上的表现:二元分类(软件包检测)与多标签分类(特定恶意特征标识)。我们进一步研究了提示策略、温度设置及模型规格对检测准确率的影响。研究发现大语言模型能力存在显著的"粒度鸿沟":GPT-4.1在二元检测中达到近乎完美的性能(F1 $\approx$ 0.99),但当任务转向识别具体恶意特征时,性能下降约41%。我们观察到通用模型最适合过滤大多数威胁,而专用编程模型更擅长检测遵循严格、可预测代码结构的攻击。相关性分析表明,参数规模与上下文宽度对检测准确率的解释力可忽略不计。结论指出,虽然大语言模型在软件包层面是强大的检测工具,但缺乏在细粒度特征层面进行精确识别所需的语义深度。