The Role of Data Filtering in Open Source Software Ranking and Selection

Faced with over 100M open source projects most empirical investigations select a subset. Most research papers in leading venues investigated filtering projects by some measure of popularity with explicit or implicit arguments that unpopular projects are not of interest, may not even represent "real" software projects, or that less popular projects are not worthy of study. However, such filtering may have enormous effects on the results of the studies if and precisely because the sought-out response or prediction is in any way related to the filtering criteria. We exemplify the impact of this practice on research outcomes: how filtering of projects listed on GitHub affects the assessment of their popularity. We randomly sample over 100,000 repositories and use multiple regression to model the number of stars (a proxy for popularity) based on the number of commits, the duration of the project, the number of authors, and the number of core developers. Comparing control with the entire dataset with a filtered model projects having ten or more authors we find that while certain characteristics of the repository consistently predict popularity, the filtering process significantly alters the relation ships between these characteristics and the response. The number of commits exhibited a positive correlation with popularity in the control sample but showed a negative correlation in the filtered sample. These findings highlight the potential biases introduced by data filtering and emphasize the need for careful sample selection in empirical research of mining software repositories. We recommend that empirical work should either analyze complete datasets such as World of Code, or employ stratified random sampling from a complete dataset to ensure that filtering is not biasing the results.

翻译：面对超过1亿个开源项目，大多数实证研究仅选取子集进行分析。在顶级期刊发表的研究论文中，许多基于某种流行度指标对项目进行过滤，其明示或暗含的论据是：不流行项目缺乏研究价值，甚至可能不构成"真正"的软件项目，或低流行度项目不值得研究。然而，当研究的目标响应或预测结果与过滤标准存在关联时，这种过滤方法可能对研究结论产生重大影响。本文通过实证案例说明该实践对研究结果的影响：如何基于GitHub项目列表的过滤方式影响其流行度评估。我们随机抽样超过10万个存储库，采用多元回归模型，以提交次数、项目持续时间、贡献者人数及核心开发者数量为变量，预测星标数（流行度代理指标）。通过对比完整数据集与过滤后模型（筛选包含十位及以上贡献者的项目），发现尽管存储库的某些特征能稳定预测流行度，但过滤过程显著改变了这些特征与响应变量之间的关系。在对照组样本中，提交次数与流行度呈正相关，但在过滤样本中却呈现负相关。这些发现揭示了数据过滤可能引入的偏差，强调在软件仓库挖掘实证研究中需谨慎选择样本。我们建议实证研究应分析完整数据集（如World of Code），或采用分层随机抽样方法，以确保过滤不会导致结果偏差。