An Empirical Study on the Effectiveness of Large Language Models for SATD Identification and Classification

Self-Admitted Technical Debt (SATD), a concept highlighting sub-optimal choices in software development documented in code comments or other project resources, poses challenges in the maintainability and evolution of software systems. Large language models (LLMs) have demonstrated significant effectiveness across a broad range of software tasks, especially in software text generation tasks. Nonetheless, their effectiveness in tasks related to SATD is still under-researched. In this paper, we investigate the efficacy of LLMs in both identification and classification of SATD. For both tasks, we investigate the performance gain from using more recent LLMs, specifically the Flan-T5 family, across different common usage settings. Our results demonstrate that for SATD identification, all fine-tuned LLMs outperform the best existing non-LLM baseline, i.e., the CNN model, with a 4.4% to 7.2% improvement in F1 score. In the SATD classification task, while our largest fine-tuned model, Flan-T5-XL, still led in performance, the CNN model exhibited competitive results, even surpassing four of six LLMs. We also found that the largest Flan-T5 model, i.e., Flan-T5-XXL, when used with a zero-shot in-context learning (ICL) approach for SATD identification, provides competitive results with traditional approaches but performs 6.4% to 9.2% worse than fine-tuned LLMs. For SATD classification, few-shot ICL approach, incorporating examples and category descriptions in prompts, outperforms the zero-shot approach and even surpasses the fine-tuned smaller Flan-T5 models. Moreover, our experiments demonstrate that incorporating contextual information, such as surrounding code, into the SATD classification task enables larger fine-tuned LLMs to improve their performance.

翻译：自我承认技术债务（SATD）是一个强调软件开发中次优选择的概念，这些选择记录在代码注释或其他项目资源中，对软件系统的可维护性和演进构成挑战。大型语言模型（LLMs）在广泛的软件任务中已展现出显著效果，尤其在软件文本生成任务中。然而，它们在SATD相关任务中的有效性仍研究不足。本文中，我们探究了LLMs在SATD识别与分类两项任务中的效能。针对这两项任务，我们研究了使用更新型LLMs（特别是Flan-T5系列）在不同常见使用场景下的性能提升。结果表明，对于SATD识别，所有微调后的LLMs均优于现有最佳非LLM基线（即CNN模型），在F1分数上实现了4.4%至7.2%的提升。在SATD分类任务中，虽然我们最大的微调模型Flan-T5-XL仍保持领先性能，但CNN模型展现出竞争性结果，甚至超越了六种LLMs中的四种。我们还发现，最大的Flan-T5模型（即Flan-T5-XXL）在采用零样本上下文学习（ICL）方法进行SATD识别时，虽能提供与传统方法相当的结果，但性能比微调后的LLMs低6.4%至9.2%。对于SATD分类，采用少样本ICL方法（在提示中融入示例和类别描述）优于零样本方法，甚至超越了微调后的较小Flan-T5模型。此外，我们的实验表明，将上下文信息（如周围代码）融入SATD分类任务，可使更大的微调LLMs提升其性能。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日