PRESTI: Predicting Repayment Effort of Self-Admitted Technical Debt Using Textual Information

Technical debt refers to the consequences of sub-optimal decisions made during software development that prioritize short-term benefits over long-term maintainability. Self-Admitted Technical Debt (SATD) is a specific form of technical debt, explicitly documented by developers within software artifacts such as source code comments and commit messages. As SATD can hinder software development and maintenance, it is crucial to estimate the effort required to repay it so that we can effectively prioritize it. However, we currently lack an understanding of SATD repayment, and more importantly, we lack approaches that can automatically estimate the repayment effort of SATD based on its textual description. To bridge this gap, we have curated a comprehensive dataset of 341,740 SATD items from 2,568,728 commits across 1,060 Apache repositories and analyzed the repayment effort comparing SATD vs. non-SATD items, as well as different types of SATD items. Furthermore, we proposed an innovative approach for Predicting Repayment Effort of SATD using Textual Information, named PRESTI. Our findings show that different types of SATD require varying levels of repayment effort, with code/design, requirement, and test debt demanding greater effort compared to non-SATD items, while documentation debt requires less. We have evaluated our approaches, particularly BERT- and TextCNN-based models, which outperform traditional machine learning methods and the baseline in estimating repayment effort. Additionally, we summarize keywords associated with varying levels of repayment effort that occur during SATD repayment. Our work aims to enhance SATD repayment prioritization and resource allocation, thereby improving software development and maintainability.

翻译：技术债务是指在软件开发过程中，为追求短期利益而牺牲长期可维护性所做出的次优决策带来的后果。自承认技术债务（SATD）是技术债务的一种特定形式，由开发者在源代码注释、提交信息等软件制品中明确记录。由于SATD可能阻碍软件开发和维护，准确估算其偿还所需工作量对于有效确定其修复优先级至关重要。然而，目前学界对SATD的偿还过程尚缺乏深入理解，更重要的是，缺乏能够基于SATD文本描述自动估算其偿还工作量的方法。为填补这一空白，我们构建了一个包含1,060个Apache代码库中2,568,728次提交的341,740个SATD条目的综合数据集，并对比分析了SATD与非SATD条目以及不同类型SATD条目的偿还工作量差异。在此基础上，我们提出了一种创新的基于文本信息的SATD偿还工作量预测方法，命名为PRESTI。研究发现：不同类型的SATD需要差异化的偿还工作量，其中代码/设计类、需求类和测试类债务的偿还工作量显著高于非SATD条目，而文档类债务的偿还工作量相对较低。我们评估了所提出的方法，特别是基于BERT和TextCNN的模型，在估算偿还工作量方面均优于传统机器学习方法和基线模型。此外，我们总结了SATD偿还过程中与不同工作量等级相关的关键词。本研究旨在优化SATD偿还的优先级排序与资源分配策略，从而提升软件开发和可维护性水平。