Code completion is a key feature of Integrated Development Environments (IDEs), aimed at predicting the next tokens a developer is likely to write, helping them write code faster and with less effort. Modern code completion approaches are often powered by deep learning (DL) models. However, the swift evolution of programming languages poses a critical challenge to the performance of DL-based code completion models: Can these models generalize across different language versions? This paper delves into such a question. In particular, we assess the capabilities of a state-of-the-art model, CodeT5, to generalize across nine different Java versions, ranging from Java 2 to Java 17, while being exclusively trained on Java 8 code. Our evaluation spans three completion scenarios, namely, predicting tokens, constructs (e.g., the condition of an if statement) and entire code blocks. The results of our study reveal a noticeable disparity among language versions, with the worst performance being obtained in Java 2 and 17 - the most far apart versions compared to Java 8. We investigate possible causes for the performance degradation and show that the adoption of a limited version-specific fine-tuning can partially alleviate the problem. Our work raises awareness on the importance of continuous model refinement, and it can inform the design of alternatives to make code completion models more robust to language evolution.
翻译:代码补全是集成开发环境(IDE)的一项关键功能,旨在预测开发者接下来可能编写的代码标记,帮助其更快、更省力地编写代码。现代代码补全方法通常由深度学习模型驱动。然而,编程语言的快速演进对基于深度学习的代码补全模型的性能提出了严峻挑战:这些模型能否在不同语言版本间实现泛化?本文深入探讨了这一问题。具体而言,我们评估了先进模型CodeT5在不使用任何版本特定数据的情况下,仅基于Java 8代码训练,在Java 2到Java 17这九个不同Java版本间的泛化能力。我们的评估涵盖三种补全场景:预测标记、预测结构(如if语句的条件)以及预测整个代码块。研究结果显示,不同语言版本间存在显著差异,其中Java 2和Java 17(与Java 8版本差异最大的两个版本)的性能最差。我们探究了性能下降的可能原因,并表明采用有限的版本特定微调可以部分缓解该问题。我们的工作提升了人们对持续模型优化重要性的认识,并为设计替代方案提供了参考,以增强代码补全模型对语言演进的鲁棒性。