The prevalence of malicious packages in open-source repositories, such as PyPI, poses a critical threat to the software supply chain. While Large Language Models (LLMs) have emerged as a promising tool for automated security tasks, their effectiveness in detecting malicious packages and indicators remains underexplored. This paper presents a systematic evaluation of 13 LLMs for detecting malicious software packages. Using a curated dataset of 4,070 packages (3,700 benign and 370 malicious), we evaluate model performance across two tasks: binary classification (package detection) and multi-label classification (identification of specific malicious indicators). We further investigate the impact of prompting strategies, temperature settings, and model specifications on detection accuracy. We find a significant "granularity gap" in LLMs' capabilities. While GPT-4.1 achieves near-perfect performance in binary detection (F1 $\approx$ 0.99), performance degrades by approximately 41\% when the task shifts to identifying specific malicious indicators. We observe that general models are best for filtering out the majority of threats, while specialized coder models are better at detecting attacks that follow a strict, predictable code structure. Our correlation analysis indicates that parameter size and context width have negligible explanatory power regarding detection accuracy. We conclude that while LLMs are powerful detectors at the package level, they lack the semantic depth required for precise identification at the granular indicator level.
翻译:开源软件仓库(如PyPI)中恶意软件包的泛滥对软件供应链构成了严重威胁。尽管大型语言模型(LLM)已成为自动化安全任务的有前景工具,但其在检测恶意软件包及具体威胁指标方面的有效性仍未得到充分探索。本文对13种LLM在恶意软件包检测方面进行了系统性评估。通过使用包含4,070个软件包(3,700个良性包和370个恶意包)的精选数据集,我们评估了模型在两项任务中的表现:二元分类(软件包检测)和多标签分类(特定恶意指标识别)。我们进一步研究了提示策略、温度设置和模型规格对检测准确性的影响。研究发现LLM能力存在显著的"粒度差距":GPT-4.1在二元检测中达到近乎完美的性能(F1 $\approx$ 0.99),但当任务转向识别具体恶意指标时,性能下降约41%。我们观察到通用模型最适合过滤大多数威胁,而专用编程模型更擅长检测遵循严格、可预测代码结构的攻击。相关性分析表明,参数规模和上下文宽度对检测准确性的解释力可忽略不计。结论表明,虽然LLM在软件包层面是强大的检测工具,但缺乏在细粒度指标层面进行精确识别所需的语义深度。