With the growing popularity of modularity in software development comes the rise of package managers and language ecosystems. Among them, npm stands out as the most extensive package manager, hosting more than 2 million third-party open-source packages that greatly simplify the process of building code. However, this openness also brings security risks, as evidenced by numerous package poisoning incidents. In this paper, we synchronize a local package cache containing more than 3.4 million packages in near real-time to give us access to more package code details. Further, we perform manual inspection and API call sequence analysis on packages collected from public datasets and security reports to build a hierarchical classification framework and behavioral knowledge base covering different sensitive behaviors. In addition, we propose the DONAPI, an automatic malicious npm packages detector that combines static and dynamic analysis. It makes preliminary judgments on the degree of maliciousness of packages by code reconstruction techniques and static analysis, extracts dynamic API call sequences to confirm and identify obfuscated content that static analysis can not handle alone, and finally tags malicious software packages based on the constructed behavior knowledge base. To date, we have identified and manually confirmed 325 malicious samples and discovered 2 unusual API calls and 246 API call sequences that have not appeared in known samples.
翻译:随着软件开发中模块化程度的日益提升,包管理器与语言生态系统随之兴起。其中,npm作为规模最大的包管理器,托管了超过200万个第三方开源包,极大简化了代码构建流程。然而,这种开放性也带来了安全风险,大量软件包投毒事件即为明证。本文将近实时同步了包含340余万个本地软件包缓存,以获取更全面的包代码细节。进一步地,我们对从公开数据集和安全报告中采集的软件包进行人工审查与API调用序列分析,构建了覆盖不同敏感行为的层次化分类框架与行为知识库。此外,我们提出DONAPI——一种融合静态与动态分析的自动恶意npm包检测器。该检测器通过代码重构技术与静态分析对包的恶意程度进行初步判定,提取动态API调用序列以确认并识别静态分析无法单独处理的混淆内容,最终基于构建的行为知识库对恶意软件包进行标记。截至目前,我们已识别并人工确认了325个恶意样本,发现了246个已知样本中未出现的异常API调用与API调用序列。