Self-Admitted Technical Debt (SATD), cases where developers intentionally acknowledge suboptimal solutions in code through comments, poses a significant challenge to software maintainability. Left unresolved, SATD can degrade code quality and increase maintenance costs. While Large Language Models (LLMs) have shown promise in tasks like code generation and program repair, their potential in automated SATD repayment remains underexplored. In this paper, we identify three key challenges in training and evaluating LLMs for SATD repayment: (1) dataset representativeness and scalability, (2) removal of irrelevant SATD repayments, and (3) limitations of existing evaluation metrics. To address the first two dataset-related challenges, we adopt a language-independent SATD tracing tool and design a 10-step filtering pipeline to extract SATD repayments from repositories, resulting two large-scale datasets: 58,722 items for Python and 97,347 items for Java. To improve evaluation, we introduce two diff-based metrics, BLEU-diff and CrystalBLEU-diff, which measure code changes rather than whole code. Additionally, we propose another new metric, LEMOD, which is both interpretable and informative. Using our new benchmarks and evaluation metrics, we evaluate two types of automated SATD repayment methods: fine-tuning smaller models, and prompt engineering with five large-scale models. Our results reveal that fine-tuned small models achieve comparable Exact Match (EM) scores to prompt-based approaches but underperform on BLEU-based metrics and LEMOD. Notably, Gemma-2-9B leads in EM, addressing 10.1% of Python and 8.1% of Java SATDs, while Llama-3.1-70B-Instruct and GPT-4o-mini excel on BLEU-diff, CrystalBLEU-diff, and LEMOD metrics. Our work contributes a robust benchmark, improved evaluation metrics, and a comprehensive evaluation of LLMs, advancing research on automated SATD repayment.
翻译:自承认技术债务(SATD),即开发人员通过代码注释有意承认代码中存在次优解决方案的情况,对软件可维护性构成了重大挑战。若未解决,SATD会降低代码质量并增加维护成本。尽管大型语言模型(LLMs)在代码生成和程序修复等任务中展现出潜力,但它们在自动化SATD偿还方面的潜力仍未得到充分探索。本文识别了为SATD偿还任务训练和评估LLMs的三个关键挑战:(1) 数据集的代表性和可扩展性,(2) 无关SATD偿还项的剔除,以及(3) 现有评估指标的局限性。为应对前两个与数据集相关的挑战,我们采用了一种与语言无关的SATD追踪工具,并设计了一个包含10个步骤的过滤流程,用于从代码仓库中提取SATD偿还项,最终构建了两个大规模数据集:包含58,722个条目的Python数据集和包含97,347个条目的Java数据集。为了改进评估,我们引入了两个基于代码差异(diff)的指标:BLEU-diff和CrystalBLEU-diff,它们衡量的是代码变更而非整个代码。此外,我们提出了另一个新指标LEMOD,该指标兼具可解释性和信息性。利用我们新的基准数据集和评估指标,我们评估了两种类型的自动化SATD偿还方法:对较小模型进行微调,以及对五个大规模模型进行提示工程。我们的结果表明,经过微调的小模型在精确匹配(EM)分数上与基于提示的方法相当,但在基于BLEU的指标和LEMOD上表现不佳。值得注意的是,Gemma-2-9B在EM指标上领先,解决了10.1%的Python SATD和8.1%的Java SATD;而Llama-3.1-70B-Instruct和GPT-4o-mini则在BLEU-diff、CrystalBLEU-diff和LEMOD指标上表现优异。我们的工作贡献了一个稳健的基准、改进的评估指标以及对LLMs的全面评估,推动了自动化SATD偿还领域的研究。