Forecasting the Maintained Score from the OpenSSF Scorecard for GitHub Repositories linked to PyPI libraries

The OpenSSF Scorecard is widely used to assess the security posture of open-source software repositories, with the Maintained metric indicating recent development activity and helping identify potentially abandoned dependencies. However, this metric is inherently retrospective, reflecting only the past 90 days of activity and providing no insight into future maintenance, which limits its usefulness for proactive risk assessment. In this paper, we study to what extent future maintenance activity, as captured by the OpenSSF Maintained score, can be forecasted. We analyze 3,220 GitHub repositories associated with the top 1% most central PyPI libraries by PageRank and reconstruct historical Maintained scores over a three-year period. We formulate the task as multivariate time series forecasting and consider four target representations: raw scores, bucketed maintenance levels, numerical trend slopes, and categorical trend types. We compare a statistical model (VARMA), a machine learning model (Random Forest), and a deep learning model (LSTM) across training windows of 3-12 months and forecasting horizons of 1-6 months. Our results show that future maintenance activity can be predicted with meaningful accuracy, particularly for aggregated representations such as bucketed scores and trend types, achieving accuracies above 0.95 and 0.80, respectively. Simpler statistical and machine learning models perform on par with deep learning approaches, indicating that complex architectures are not required. These findings suggest that predictive modeling can effectively complement existing Scorecard metrics, enabling more proactive assessment of open-source maintenance risks.

翻译：OpenSSF Scorecard被广泛用于评估开源软件仓库的安全态势，其中维护指标反映近期的开发活动，有助于识别可能被弃用的依赖项。然而，该指标本质上是回顾性的，仅反映过去90天的活动，无法提供对未来维护情况的洞察，这限制了其在主动风险评估中的实用性。本文研究了OpenSSF维护分数所体现的未来维护活动在多大程度上可以被预测。我们分析了与PageRank排名前1%的PyPI核心库相关联的3,220个GitHub仓库，并重建了三年期间的历史维护分数。我们将该任务构建为多元时间序列预测问题，并考虑四种目标表示形式：原始分数、分桶维护等级、数值趋势斜率和分类趋势类型。我们比较了统计模型（VARMA）、机器学习模型（随机森林）和深度学习模型（LSTM）在3-12个月训练窗口和1-6个月预测区间上的表现。结果表明，未来维护活动能够以具有实际意义的准确度进行预测，特别是对于分桶分数和趋势类型等聚合表示形式，分别达到0.95和0.80以上的准确率。较简单的统计和机器学习模型与深度学习方法表现相当，表明无需复杂架构。这些发现表明，预测建模能够有效补充现有Scorecard指标，实现对开源维护风险更主动的评估。