Technical debt refers to taking shortcuts to achieve short-term goals while sacrificing the long-term maintainability and evolvability of software systems. A large part of technical debt is explicitly reported by the developers themselves; this is commonly referred to as Self-Admitted Technical Debt or SATD. Previous work has focused on identifying SATD from source code comments and issue trackers. However, there are no approaches available for automatically identifying SATD from other sources such as commit messages and pull requests, or by combining multiple sources. Therefore, we propose and evaluate an approach for automated SATD identification that integrates four sources: source code comments, commit messages, pull requests, and issue tracking systems. Our findings show that our approach outperforms baseline approaches and achieves an average F1-score of 0.611 when detecting four types of SATD (i.e., code/design debt, requirement debt, documentation debt, and test debt) from the four aforementioned sources. Thereafter, we analyze 23.6M code comments, 1.3M commit messages, 3.7M issue sections, and 1.7M pull request sections to characterize SATD in 103 open-source projects. Furthermore, we investigate the SATD keywords and relations between SATD in different sources. The findings indicate, among others, that: 1) SATD is evenly spread among all sources; 2) issues and pull requests are the two most similar sources regarding the number of shared SATD keywords, followed by commit messages, and then followed by code comments; 3) there are four kinds of relations between SATD items in the different sources.
翻译:技术债务是指为追求短期目标而牺牲软件系统长期可维护性和可演化性的捷径做法。大部分技术债务由开发人员主动记录,这类债务被称为自承认技术债务(Self-Admitted Technical Debt, SATD)。以往研究主要关注从源代码注释和问题追踪系统中识别SATD,但尚未出现能够从提交信息、拉取请求等其他来源或多源组合自动识别SATD的方法。为此,我们提出并评估了一种整合四种来源(源代码注释、提交信息、拉取请求和问题追踪系统)的SATD自动识别方法。实验结果表明,该方法在检测上述四种来源中四种SATD类型(代码/设计债务、需求债务、文档债务和测试债务)时,平均F1值达到0.611,优于基线方法。随后,我们分析了103个开源项目中的2360万条代码注释、130万条提交信息、370万个问题区段和170万个拉取请求区段,以刻画SATD特征。进一步地,我们研究了不同来源间SATD关键词及关联关系。研究发现:1)SATD均匀分布于所有来源;2)问题与拉取请求是共享SATD关键词数量最接近的两个来源,其次是提交信息,最后是代码注释;3)不同来源的SATD条目间存在四种关联关系。