In recent years, large pre-trained Language Models of Code (CodeLMs) have shown promising results on various software engineering tasks. One such task is automatic code update recommendation, which transforms outdated code snippets into their approved and revised counterparts. Although many CodeLM-based approaches have been proposed, claiming high accuracy, their effectiveness and reliability on real-world code update tasks remain questionable. In this paper, we present the first extensive evaluation of state-of-the-art CodeLMs for automatically recommending code updates. We assess their performance on two diverse datasets of paired updated methods, considering factors such as temporal evolution, project specificity, method size, and update complexity. Our results reveal that while CodeLMs perform well in settings that ignore temporal information, they struggle in more realistic time-wise scenarios and generalize poorly to new projects. Furthermore, CodeLM performance decreases significantly for larger methods and more complex updates. Furthermore, we observe that many CodeLM-generated "updates" are actually null, especially in time-wise settings, and meaningful edits remain challenging. Our findings highlight the significant gap between the perceived and actual effectiveness of CodeLMs for real-world code update recommendation and emphasize the need for more research on improving their practicality, robustness, and generalizability.
翻译:近年来,大型预训练代码语言模型(CodeLMs)在各种软件工程任务中展现出令人期待的结果。其中一项任务是自动代码更新推荐,即将过时代码片段转换为经过批准和修订的对应版本。尽管已提出许多基于CodeLM的方法并声称具有高准确性,但它们在真实世界代码更新任务中的有效性和可靠性仍存疑。本文首次对最先进的CodeLM在自动推荐代码更新方面的能力进行了全面评估。我们基于两个包含配对更新方法的不同数据集,综合考虑时间演化、项目特异性、方法规模及更新复杂度等因素,评估其性能。结果表明,虽然CodeLM在忽略时间信息的设定下表现良好,但在更现实的时序场景中却难以应对,且对新项目的泛化能力较差。此外,CodeLM对较大方法和更复杂更新的性能显著下降。进一步地,我们观察到CodeLM生成的许多“更新”实际为空,尤其在时序设定下,而有意义的编辑仍然充满挑战。我们的发现揭示了CodeLM在真实世界代码更新推荐中感知效果与实际效果之间的显著差距,并强调了提升其实用性、鲁棒性和泛化能力需要进一步研究。