The rise of malicious packages in public registries poses a significant threat to software supply chain (SSC) security. Although academia and industry employ methods like software composition analysis (SCA) to address this issue, existing approaches often lack timely and comprehensive intelligence updates. This paper introduces PackageIntel, a novel platform that revolutionizes the collection, processing, and retrieval of malicious package intelligence. By utilizing exhaustive search techniques, snowball sampling from diverse sources, and large language models (LLMs) with specialized prompts, PackageIntel ensures enhanced coverage, timeliness, and accuracy. We have developed a comprehensive database containing 20,692 malicious NPM and PyPI packages sourced from 21 distinct intelligence repositories. Empirical evaluations demonstrate that PackageIntel achieves a precision of 98.6% and an F1 score of 92.0 in intelligence extraction. Additionally, it detects threats on average 70% earlier than leading databases like Snyk and OSV, and operates cost-effectively at $0.094 per intelligence piece. The platform has successfully identified and reported over 1,000 malicious packages in downstream package manager mirror registries. This research provides a robust, efficient, and timely solution for identifying and mitigating threats within the software supply chain ecosystem.
翻译:公共注册表中恶意软件包的兴起对软件供应链安全构成了重大威胁。尽管学术界和工业界采用软件成分分析等方法应对此问题,但现有方法往往缺乏及时且全面的情报更新。本文介绍了PackageIntel,这是一个革新恶意软件包情报收集、处理和检索的新型平台。通过采用穷举搜索技术、来自多样化来源的滚雪球抽样,以及结合专用提示词的大型语言模型,PackageIntel确保了更高的覆盖范围、时效性和准确性。我们构建了一个包含20,692个恶意NPM和PyPI软件包的综合性数据库,其情报来源于21个不同的情报仓库。实证评估表明,PackageIntel在情报提取方面达到了98.6%的精确率和92.0的F1分数。此外,与Snyk和OSV等主流数据库相比,其平均提前70%检测到威胁,且每条情报的处理成本仅为0.094美元,具有显著成本效益。该平台已成功在下游软件包管理器镜像注册表中识别并报告了超过1,000个恶意软件包。本研究为软件供应链生态系统中威胁的识别与缓解提供了一个鲁棒、高效且及时的解决方案。