The widespread adoption of open-source ecosystems enables developers to integrate third-party packages, but also exposes them to malicious packages crafted to execute harmful behavior via public repositories such as PyPI. Existing datasets (e.g., pypi-malregistry, DataDog, OpenSSF, MalwareBench) label packages as malicious or benign at the package level, but do not specify which statements implement malicious behavior. This coarse granularity limits research and practice: models cannot be trained to localize malicious code, detectors cannot justify alerts with code-level evidence, and analysts cannot systematically study recurring malicious indicators or attack chains. To address this gap, we construct a statement-level dataset of 370 malicious Python packages (833 files, 90,527 lines) with 2,962 labeled occurrences of malicious indicators. From these annotations, we derive a fine-grained taxonomy of 47 malicious indicators across 7 types that capture how adversarial behavior is implemented in code, and we apply sequential pattern mining to uncover recurring indicator sequences that characterize common attack workflows. Our contribution enables explainable, behavior-centric detection and supports both semantic-aware model training and practical heuristics for strengthening software supply-chain defenses.
翻译:开源生态系统的广泛采用使开发者能够集成第三方软件包,但也使其面临通过PyPI等公共仓库传播的恶意软件包威胁。现有数据集(如pypi-malregistry、DataDog、OpenSSF、MalwareBench)仅在软件包级别标注恶意或良性属性,但未指明具体哪些语句实现了恶意行为。这种粗粒度标注限制了研究与实践:模型无法训练以定位恶意代码,检测器无法提供代码级证据支撑警报,分析师无法系统研究重复出现的恶意指标或攻击链。为填补这一空白,我们构建了包含370个恶意Python软件包(833个文件,90,527行代码)的语句级数据集,标注了2,962个恶意指标实例。基于这些标注,我们提出了涵盖7大类47种细粒度恶意指标的分类体系,以捕捉对抗行为在代码中的实现方式,并应用序列模式挖掘技术揭示表征常见攻击工作流的重复性指标序列。本研究成果支持可解释的、以行为为中心的检测方法,同时为增强软件供应链防御提供了语义感知模型训练基础与实践启发式策略。