Predicting long time contributors with knowledge units of programming languages: an empirical study

Predicting potential long-time contributors (LTCs) early allows project maintainers to effectively allocate resources and mentoring to enhance their development and retention. Mapping programming language expertise to developers and characterizing projects in terms of how they use programming languages can help identify developers who are more likely to become LTCs. However, prior studies on predicting LTCs do not consider programming language skills. This paper reports an empirical study on the usage of knowledge units (KUs) of the Java programming language to predict LTCs. A KU is a cohesive set of key capabilities that are offered by one or more building blocks of a given programming language. We build a prediction model called KULTC, which leverages KU-based features along five different dimensions. We detect and analyze KUs from the studied 75 Java projects (353K commits and 168K pull requests) as well as 4,219 other Java projects in which the studied developers previously worked (1.7M commits). We compare the performance of KULTC with the state-of-the-art model, which we call BAOLTC. Even though KULTC focuses exclusively on the programming language perspective, KULTC achieves a median AUC of at least 0.75 and significantly outperforms BAOLTC. Combining the features of KULTC with the features of BAOLTC results in an enhanced model (KULTC+BAOLTC) that significantly outperforms BAOLTC with a normalized AUC improvement of 16.5%. Our feature importance analysis with SHAP reveals that developer expertise in the studied project is the most influential feature dimension for predicting LTCs. Finally, we develop a cost-effective model (KULTC_DEV_EXP+BAOLTC) that significantly outperforms BAOLTC. These encouraging results can be helpful to researchers who wish to further study the developers' engagement/retention to FLOSS projects or build models for predicting LTCs.

翻译：早期预测潜在长期贡献者（LTCs）有助于项目维护者有效分配资源和指导，以促进其成长与留存。将编程语言专长映射至开发者，并根据项目对编程语言的使用方式来刻画项目特征，有助于识别更有可能成为LTCs的开发者。然而，先前关于预测LTCs的研究并未考虑编程语言技能。本文报告了一项实证研究，探讨如何利用Java编程语言的知识单元（KUs）来预测LTCs。知识单元是指由特定编程语言的一个或多个构建模块所提供的一组内聚的关键能力集合。我们构建了一个名为KULTC的预测模型，该模型利用基于知识单元的特征，涵盖五个不同维度。我们从研究的75个Java项目（353K次提交和168K个拉取请求）以及4,219个相关开发者曾参与的其他Java项目（170万次提交）中检测并分析了知识单元。我们将KULTC的性能与现有最先进的模型（称为BAOLTC）进行了比较。尽管KULTC仅专注于编程语言视角，其AUC中位数仍达到至少0.75，且显著优于BAOLTC。将KULTC的特征与BAOLTC的特征相结合，得到一个增强模型（KULTC+BAOLTC），其性能显著优于BAOLTC，归一化AUC提升达16.5%。我们利用SHAP进行的特征重要性分析表明，开发者在所研究项目中的专业知识是预测LTCs最具影响力的特征维度。最后，我们开发了一个高性价比模型（KULTC_DEV_EXP+BAOLTC），其性能显著优于BAOLTC。这些积极的结果可为希望进一步研究开发者对自由/开源软件项目的参与度/留存率，或构建预测LTCs模型的研究者提供有益参考。