npm-follower: A Complete Dataset Tracking the NPM Ecosystem

Software developers typically rely upon a large network of dependencies to build their applications. For instance, the NPM package repository contains over 3 million packages and serves tens of billions of downloads weekly. Understanding the structure and nature of packages, dependencies, and published code requires datasets that provide researchers with easy access to metadata and code of packages. However, prior work on NPM dataset construction typically has two limitations: 1) only metadata is scraped, and 2) packages or versions that are deleted from NPM can not be scraped. Over 330,000 versions of packages were deleted from NPM between July 2022 and May 2023. This data is critical for researchers as it often pertains to important questions of security and malware. We present npm-follower, a dataset and crawling architecture which archives metadata and code of all packages and versions as they are published, and is thus able to retain data which is later deleted. The dataset currently includes over 35 million versions of packages, and grows at a rate of about 1 million versions per month. The dataset is designed to be easily used by researchers answering questions involving either metadata or program analysis. Both the code and dataset are available at https://dependencies.science.

翻译：软件开发者通常依赖庞大的依赖网络来构建应用程序。例如，NPM包仓库包含超过300万个包，每周服务数十亿次下载。理解包、依赖关系以及发布代码的结构和性质，需要为研究者提供便捷访问元数据和包代码的数据集。然而，先前关于NPM数据集构建的工作通常存在两个局限：1）仅抓取元数据；2）无法抓取从NPM中删除的包或版本。2022年7月至2023年5月期间，超过33万个包版本从NPM中被删除。这些数据对研究者至关重要，因为它们常涉及安全性和恶意软件等关键问题。我们提出了npm-follower，这是一个数据集与爬取架构，能够在所有包和版本发布时存档其元数据和代码，从而保留后续被删除的数据。该数据集目前已包含超过3500万个包版本，并以每月约100万个版本的速度增长。该数据集设计用于帮助研究者轻松解答涉及元数据或程序分析的问题。代码和数据集均可在https://dependencies.science获取。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日