Self-Admitted Technical Debt (SATD), a concept highlighting sub-optimal choices in software development documented in code comments or other project resources, poses challenges in the maintainability and evolution of software systems. Large language models (LLMs) have demonstrated significant effectiveness across a broad range of software tasks, especially in software text generation tasks. Nonetheless, their effectiveness in tasks related to SATD is still under-researched. In this paper, we investigate the efficacy of LLMs in both identification and classification of SATD. For both tasks, we investigate the performance gain from using more recent LLMs, specifically the Flan-T5 family, across different common usage settings. Our results demonstrate that for SATD identification, all fine-tuned LLMs outperform the best existing non-LLM baseline, i.e., the CNN model, with a 4.4% to 7.2% improvement in F1 score. In the SATD classification task, while our largest fine-tuned model, Flan-T5-XL, still led in performance, the CNN model exhibited competitive results, even surpassing four of six LLMs. We also found that the largest Flan-T5 model, i.e., Flan-T5-XXL, when used with a zero-shot in-context learning (ICL) approach for SATD identification, provides competitive results with traditional approaches but performs 6.4% to 9.2% worse than fine-tuned LLMs. For SATD classification, few-shot ICL approach, incorporating examples and category descriptions in prompts, outperforms the zero-shot approach and even surpasses the fine-tuned smaller Flan-T5 models. Moreover, our experiments demonstrate that incorporating contextual information, such as surrounding code, into the SATD classification task enables larger fine-tuned LLMs to improve their performance.
翻译:自我承认技术债务(SATD)是一个强调软件开发中次优选择的概念,这些选择记录在代码注释或其他项目资源中,对软件系统的可维护性和演进构成挑战。大型语言模型(LLMs)在广泛的软件任务中已展现出显著效果,尤其在软件文本生成任务中。然而,它们在SATD相关任务中的有效性仍研究不足。本文中,我们探究了LLMs在SATD识别与分类两项任务中的效能。针对这两项任务,我们研究了使用更新型LLMs(特别是Flan-T5系列)在不同常见使用场景下的性能提升。结果表明,对于SATD识别,所有微调后的LLMs均优于现有最佳非LLM基线(即CNN模型),在F1分数上实现了4.4%至7.2%的提升。在SATD分类任务中,虽然我们最大的微调模型Flan-T5-XL仍保持领先性能,但CNN模型展现出竞争性结果,甚至超越了六种LLMs中的四种。我们还发现,最大的Flan-T5模型(即Flan-T5-XXL)在采用零样本上下文学习(ICL)方法进行SATD识别时,虽能提供与传统方法相当的结果,但性能比微调后的LLMs低6.4%至9.2%。对于SATD分类,采用少样本ICL方法(在提示中融入示例和类别描述)优于零样本方法,甚至超越了微调后的较小Flan-T5模型。此外,我们的实验表明,将上下文信息(如周围代码)融入SATD分类任务,可使更大的微调LLMs提升其性能。