Technical debt refers to the consequences of sub-optimal decisions made during software development that prioritize short-term benefits over long-term maintainability. Self-Admitted Technical Debt (SATD) is a specific form of technical debt, explicitly documented by developers within software artifacts such as source code comments and commit messages. As SATD can hinder software development and maintenance, it is crucial to address and prioritize it effectively. However, current methodologies lack the ability to automatically estimate the repayment effort of SATD based on its textual descriptions. To address this limitation, we propose a novel approach for automatically estimating SATD repayment effort, utilizing a comprehensive dataset comprising 341,740 SATD items from 2,568,728 commits across 1,060 Apache repositories. Our findings show that different types of SATD require varying levels of repayment effort, with code/design, requirement, and test debt demanding greater effort compared to non-SATD items, while documentation debt requires less. We introduce and evaluate machine learning methodologies, particularly BERT and TextCNN, which outperforms classic machine learning methods and the naive baseline in estimating repayment effort. Additionally, we summarize keywords associated with varying levels of repayment effort that occur during SATD repayment. Our contributions aim to enhance the prioritization of SATD repayment effort and resource allocation efficiency, ultimately benefiting software development and maintainability.
翻译:技术债务指软件开发中为短期利益而牺牲长期可维护性所导致的次优决策后果。自承认技术债务(Self-Admitted Technical Debt, SATD)是技术债务的特殊形式,由开发人员在源代码注释、提交信息等软件制品中明确记录。由于SATD会阻碍软件开发与维护,对其进行有效识别和优先级排序至关重要。然而,现有方法无法基于文本描述自动估算SATD的偿还工作量。针对这一局限,我们提出了一种自动估算SATD偿还工作量的新方法,利用涵盖1,060个Apache仓库中2,568,728次提交的341,740个SATD项的综合数据集。研究结果表明,不同类型的SATD所需偿还工作量存在差异:代码/设计债务、需求债务和测试债务的偿还工作量显著高于非SATD项,而文档债务的工作量则较小。我们引入并评估了机器学习方法,其中BERT和TextCNN在估算偿还工作量方面优于传统机器学习方法和朴素基线模型。此外,我们还归纳了与不同偿还工作量水平相关的关键词。本研究成果旨在优化SATD偿还工作的优先级排序和资源分配效率,最终提升软件开发和可维护性。