The open-source software (OSS) ecosystem suffers from various security threats and risks, and malicious packages play a central role in software supply chain (SSC) attacks. Although malware research has a history of over thirty years, less attention has been paid to OSS malware. Its existing research has three limitations: a lack of high-quality datasets, malware diversity, and attack campaign context. In this paper, we first build and curate the largest dataset of 23,425 malicious packages from scattered online sources. We then propose a knowledge graph to represent the OSS malware corpus and conduct malicious package analysis in the wild. Our main findings include (1) it is essential to collect malicious packages from various online sources because there is little data overlap between different sources; (2) despite the sheer volume of SSC attack campaigns, many malicious packages are similar, and unknown/sophisticated attack behaviors have yet to emerge or be detected; (3) OSS malicious package has its distinct life cycle, denoted as {changing->release->detection->removal}, and slightly changing the package (different name) is a widespread attack manner; (4) while malicious packages often lack context about how and who released them, security reports disclose the information about corresponding SSC attack campaigns.
翻译:开源软件(OSS)生态系统面临各类安全威胁与风险,其中恶意软件包在软件供应链(SSC)攻击中扮演核心角色。尽管恶意软件研究已有三十余年历史,但针对OSS恶意软件的关注度仍显不足。现有研究存在三方面局限:高质量数据集匮乏、恶意软件多样性不足、以及攻击活动上下文缺失。本文首先从分散的在线来源构建并整理了包含23,425个恶意软件包的最大规模数据集,随后提出知识图谱表示OSS恶意软件语料库,并开展野外恶意软件包分析。主要发现包括:(1)不同来源的恶意软件包数据重叠极少,因此必须从多种在线渠道进行采集;(2)尽管SSC攻击活动数量庞大,多数恶意软件包具有相似性,未知/复杂攻击行为尚未出现或未被检测到;(3)OSS恶意软件包具有独特的生命周期,记为{修改→发布→检测→移除},轻微修改软件包(使用不同名称)是普遍攻击方式;(4)虽然恶意软件包常缺乏发布方式与发布者信息,但安全报告会披露对应SSC攻击活动的相关背景。