Enriching Source Code with Contextual Data for Code Completion Models: An Empirical Study

Transformer-based pre-trained models have recently achieved great results in solving many software engineering tasks including automatic code completion which is a staple in a developer's toolkit. While many have striven to improve the code-understanding abilities of such models, the opposite -- making the code easier to understand -- has not been properly investigated. In this study, we aim to answer whether making code easier to understand through using contextual data improves the performance of pre-trained code language models for the task of code completion. We consider type annotations and comments as two common forms of additional contextual information that often help developers understand code better. For the experiments, we study code completion in two granularity levels; token and line completion and take three recent and large-scale language models for source code: UniXcoder, CodeGPT, and InCoder with five evaluation metrics. Finally, we perform the Wilcoxon Signed Rank test to gauge significance and measure the effect size. Contrary to our expectations, all models perform better if type annotations are removed (albeit the effect sizes are small). For comments, we find that the models perform better in the presence of multi-line comments (again with small effect sizes). Based on our observations, we recommend making proper design choices when training, fine-tuning, or simply selecting such models given the intended data and application. Better evaluations and multi-modal techniques can also be further investigated to improve the practicality and accuracy of auto-completions.

翻译：基于Transformer的预训练模型近期在解决许多软件工程任务中取得了显著成果，包括开发者工具包中的核心功能——自动代码补全。尽管许多研究致力于提升这类模型的代码理解能力，但反向思路——即让代码更易于理解——尚未得到充分探索。本研究旨在探究通过上下文数据增强代码可理解性是否能够提升预训练代码语言模型在代码补全任务中的性能。我们将类型注解和注释视为两种常见的额外上下文信息，它们通常有助于开发者更好地理解代码。实验从两个粒度层级（词元补全和行补全）研究代码补全，并选用UniXcoder、CodeGPT和InCoder三种近期大规模源代码语言模型，采用五项评估指标。最后，我们通过Wilcoxon符号秩检验评估显著性并测量效应量。与预期相反，所有模型在移除类型注解后表现更优（尽管效应量较小）。针对注释，我们发现多行注释的存在能提升模型性能（同样具有较小效应量）。基于观察结果，我们建议在根据目标数据和应用程序训练、微调或直接选择这类模型时做出恰当的设计决策。此外，可进一步研究更优的评估方法与多模态技术，以提升自动补全的实用性和准确性。