Self-Admitted Technical Debt (SATD) refers to circumstances where developers use textual artifacts to explain why the existing implementation is not optimal. Past research in detecting SATD has focused on either identifying SATD (classifying SATD items as SATD or not) or categorizing SATD (labeling instances as SATD that pertain to requirement, design, code, test debt, etc.). However, the performance of these approaches remains suboptimal, particularly for specific types of SATD, such as test and requirement debt, primarily due to extremely imbalanced datasets. To address these challenges, we build on earlier research by utilizing BiLSTM architecture for the binary identification of SATD and BERT architecture for categorizing different types of SATD. Despite their effectiveness, both architectures struggle with imbalanced data. Therefore, we employ a large language model data augmentation strategy to mitigate this issue. Furthermore, we introduce a two-step approach to identify and categorize SATD across various datasets derived from different artifacts. Our contributions include providing a balanced dataset for future SATD researchers and demonstrating that our approach significantly improves SATD identification and categorization performance compared to baseline methods.
翻译:自承认技术债务(SATD)指开发者通过文本工件解释现有实现为何非最优的情况。既往SATD检测研究主要集中于SATD识别(将SATD条目分类为SATD或非SATD)或SATD分类(将实例标记为涉及需求、设计、代码、测试债务等类别的SATD)。然而,这些方法的性能仍不理想,特别是对于测试债务和需求债务等特定SATD类型,其主要原因在于数据集存在极端不平衡问题。为应对这些挑战,本研究基于前期工作,采用BiLSTM架构进行SATD的二元识别,并利用BERT架构实现不同类型SATD的分类。尽管这两种架构具有良好效果,但均受限于不平衡数据。为此,我们采用大语言模型数据增强策略以缓解该问题。此外,我们提出一种两步法来识别和分类源自不同工件的多数据集中的SATD。本研究的贡献包括:为未来SATD研究者提供平衡数据集,并证明相较于基线方法,我们的方法能显著提升SATD识别与分类性能。