Software developers typically rely upon a large network of dependencies to build their applications. For instance, the NPM package repository contains over 3 million packages and serves tens of billions of downloads weekly. Understanding the structure and nature of packages, dependencies, and published code requires datasets that provide researchers with easy access to metadata and code of packages. However, prior work on NPM dataset construction typically has two limitations: 1) only metadata is scraped, and 2) packages or versions that are deleted from NPM can not be scraped. Over 330,000 versions of packages were deleted from NPM between July 2022 and May 2023. This data is critical for researchers as it often pertains to important questions of security and malware. We present npm-follower, a dataset and crawling architecture which archives metadata and code of all packages and versions as they are published, and is thus able to retain data which is later deleted. The dataset currently includes over 35 million versions of packages, and grows at a rate of about 1 million versions per month. The dataset is designed to be easily used by researchers answering questions involving either metadata or program analysis. Both the code and dataset are available at https://dependencies.science.
翻译:软件开发者通常依赖庞大的依赖网络来构建应用程序。例如,NPM包仓库包含超过300万个包,每周服务数十亿次下载。理解包、依赖关系以及发布代码的结构和性质,需要为研究者提供便捷访问元数据和包代码的数据集。然而,先前关于NPM数据集构建的工作通常存在两个局限:1)仅抓取元数据;2)无法抓取从NPM中删除的包或版本。2022年7月至2023年5月期间,超过33万个包版本从NPM中被删除。这些数据对研究者至关重要,因为它们常涉及安全性和恶意软件等关键问题。我们提出了npm-follower,这是一个数据集与爬取架构,能够在所有包和版本发布时存档其元数据和代码,从而保留后续被删除的数据。该数据集目前已包含超过3500万个包版本,并以每月约100万个版本的速度增长。该数据集设计用于帮助研究者轻松解答涉及元数据或程序分析的问题。代码和数据集均可在https://dependencies.science获取。