One of the primary factors that encourage developers to contribute to open source software (OSS) projects is the collaborative nature of OSS development. However, the collaborative structure of these communities largely remains unclear, partly due to the enormous scale of data to be gathered, processed, and analyzed. In this work, we utilize the World Of Code dataset, which contains commit activity data for millions of OSS projects, to build collaboration networks for ten popular programming language ecosystems, containing in total over 290M commits across over 18M projects. We build a collaboration graph representation for each language ecosystem, having authors and projects as nodes, which enables various forms of social network analysis on the scale of language ecosystems. Moreover, we capture the information on the ecosystems' evolution by slicing each network into 30 historical snapshots. Additionally, we calculate multiple collaboration metrics that characterize the ecosystems' states. We make the resulting dataset publicly available, including the constructed graphs and the pipeline enabling the analysis of more ecosystems.
翻译:鼓励开发者参与开源软件项目的主要因素之一是开源软件开发的协作性质。然而,这些社区的协作结构在很大程度上仍不明确,部分原因是需要收集、处理和分析的数据规模巨大。在本工作中,我们利用"世界代码"数据集(该数据集包含数百万开源项目的提交活动数据),为十个流行的编程语言生态系统构建协作网络,这些网络总共涵盖超过1800万个项目中的2.9亿次提交。我们为每个语言生态系统构建一个协作图表示,以作者和项目为节点,从而能够在语言生态系统规模上进行多种形式的社会网络分析。此外,我们通过将每个网络划分为30个历史快照来捕获生态系统的演化信息。同时,我们计算了多个刻画生态系统状态的协作度量指标。我们将生成的数据集公开,包括构建的图以及能够支持分析更多生态系统的流水线。