Technical debt refers to taking shortcuts to achieve short-term goals while sacrificing the long-term maintainability and evolvability of software systems. A large part of technical debt is explicitly reported by the developers themselves; this is commonly referred to as Self-Admitted Technical Debt or SATD. Previous work has focused on identifying SATD from source code comments and issue trackers. However, there are no approaches available for automatically identifying SATD from other sources such as commit messages and pull requests, or by combining multiple sources. Therefore, we propose and evaluate an approach for automated SATD identification that integrates four sources: source code comments, commit messages, pull requests, and issue tracking systems. Our findings show that our approach outperforms baseline approaches and achieves an average F1-score of 0.611 when detecting four types of SATD (i.e., code/design debt, requirement debt, documentation debt, and test debt) from the four aforementioned sources. Thereafter, we analyze 23.6M code comments, 1.3M commit messages, 3.7M issue sections, and 1.7M pull request sections to characterize SATD in 103 open-source projects. Furthermore, we investigate the SATD keywords and relations between SATD in different sources. The findings indicate, among others, that: 1) SATD is evenly spread among all sources; 2) issues and pull requests are the two most similar sources regarding the number of shared SATD keywords, followed by commit messages, and then followed by code comments; 3) there are four kinds of relations between SATD items in the different sources.
翻译:技术债务指为达成短期目标而牺牲软件系统长期可维护性与可演化性的捷径行为。大量技术债务由开发者自行明确记录,这类债务常被称为自承认技术债务(Self-Admitted Technical Debt, SATD)。现有研究主要关注从源代码注释和问题追踪系统中识别SATD,但尚未出现能够从提交消息、拉取请求等其他来源自动识别SATD的方法,也缺乏多源联合识别的方案。为此,我们提出并评估了一种融合四种来源(源代码注释、提交消息、拉取请求和问题追踪系统)的SATD自动识别方法。实验表明,该方法在从上述四个来源检测四类SATD(即代码/设计债务、需求债务、文档债务和测试债务)时,平均F1分数达0.611,优于基线方法。随后,我们分析了103个开源项目中的2360万条代码注释、130万条提交消息、370万个问题区块及170万个拉取请求区块,以刻画SATD特征。此外,我们进一步探究了不同来源中SATD关键词的分布规律及其关联性。研究发现:1) SATD均匀分布于所有来源中;2) 问题追踪与拉取请求在共享SATD关键词数量上相似度最高,其次为提交消息,最后为代码注释;3) 不同来源的SATD条目间存在四种关联类型。