Background: Open Source Software is the building block of modern software. However, the prevalence of project deprecation in the open source world weakens the integrity of the downstream systems and the broad ecosystem. Therefore it calls for efforts in monitoring and predicting project deprecations, empowering stakeholders to take proactive measures. Challenge: Existing techniques mainly focus on static features on a point in time to make predictions, resulting in limited effects. Goal: We propose a novel metric from the user-repository network, and leverage the metric to fit project deprecation predictors and prove its real-life implications. Method: We establish a comprehensive dataset containing 103,354 non-fork GitHub OSS projects spanning from 2011 to 2023. We propose repository centrality, a family of HITS weights that captures shifts in the popularity of a repository in the repository-user star network. Further with the metric, we utilize the advancements in gradient boosting and deep learning to fit survival analysis models to predict project lifespan or its survival hazard. Results: Our study reveals a correlation between the HITS centrality metrics and the repository deprecation risk. A drop in the HITS weights of a repository indicates a decline in its centrality and prevalence, leading to an increase in its deprecation risk and a decrease in its expected lifespan. Our predictive models powered by repository centrality and other repository features achieve satisfactory accuracy on the test set, with repository centrality being the most significant feature among all. Implications: This research offers a novel perspective on understanding the effect of prevalence on the deprecation of OSS repositories. Our approach to predict repository deprecation help detect health status of project and take actions in advance, fostering a more resilient OSS ecosystem.
翻译:背景:开源软件是现代软件的基石。然而,开源世界中项目废弃现象的普遍性削弱了下游系统及整个生态系统的完整性。因此,亟需开展项目废弃监控与预测研究,以便利益相关者采取主动措施。挑战:现有技术主要基于时间点上的静态特征进行预测,效果有限。目标:我们从用户-仓库网络中提出一种新型度量指标,并利用该指标拟合项目废弃预测模型,验证其实际应用价值。方法:我们构建了一个包含2011年至2023年间103,354个非分叉GitHub OSS项目的综合数据集。提出仓库中心性概念,即一组HITS权重值,用于捕捉仓库在仓库-用户星型网络中的流行度变化。进一步结合该度量,利用梯度提升和深度学习的最新进展拟合生存分析模型,以预测项目生命周期或其生存风险。结果:研究表明HITS中心性度量与仓库废弃风险存在相关性。仓库HITS权重下降表明其中心性和流行度降低,导致废弃风险上升、预期生命周期缩短。基于仓库中心性及其他仓库特征的预测模型在测试集上取得了满意的准确率,其中仓库中心性是所有特征中最具显著性的变量。启示:本研究为理解流行度对OSS仓库废弃的影响提供了新视角。基于仓库中心性的预测方法有助于检测项目健康状态并提前采取行动,从而构建更具韧性的OSS生态系统。