Language Models for Code Completion: A Practical Evaluation

Transformer-based language models for automatic code completion have shown great promise so far, yet the evaluation of these models rarely uses real data. This study provides both quantitative and qualitative assessments of three public code language models when completing real-world code. We first developed an open-source IDE extension, Code4Me, for the online evaluation of the models. We collected real auto-completion usage data for over a year from more than 1200 users, resulting in over 600K valid completions. These models were then evaluated using six standard metrics across twelve programming languages. Next, we conducted a qualitative study of 1690 real-world completion requests to identify the reasons behind the poor model performance. A comparative analysis of the models' performance in online and offline settings was also performed, using benchmark synthetic datasets and two masking strategies. Our findings suggest that while developers utilize code completion across various languages, the best results are achieved for mainstream languages such as Python and Java. InCoder outperformed the other models across all programming languages, highlighting the significance of training data and objectives. Our study also revealed that offline evaluations do not accurately reflect real-world scenarios. Upon qualitative analysis of the model's predictions, we found that 66.3% of failures were due to the models' limitations, 24.4% occurred due to inappropriate model usage in a development context, and 9.3% were valid requests that developers overwrote. Given these findings, we propose several strategies to overcome the current limitations. These include refining training objectives, improving resilience to typographical errors, adopting hybrid approaches, and enhancing implementations and usability.

翻译：基于Transformer的代码自动补全语言模型至今展现出巨大潜力，然而这些模型的评估很少使用真实数据。本研究对三种公开代码语言模型在完成真实世界代码时的表现进行了定量与定性评估。我们首先开发了一款开源IDE扩展Code4Me用于在线评估模型，在超过一年的时间里收集了来自1200多名用户的真实自动补全使用数据，最终获得超过60万次有效补全。这些模型随后使用六项标准指标在十二种编程语言上进行了评估。接着，我们对1690个真实世界代码补全请求进行质性研究，以识别模型性能不佳的深层原因。同时，通过基准合成数据集与两种掩码策略，对模型在在线与离线环境下的性能进行了对比分析。研究发现：虽然开发者在不同编程语言中都会使用代码补全，但Python和Java等主流语言的效果最佳；InCoder模型在所有编程语言中的表现均优于其他模型，凸显了训练数据与训练目标的重要性。研究还发现离线评估无法准确反映真实场景。通过对模型预测结果的质性分析，我们观察到66.3%的失败源于模型自身局限，24.4%由开发环境中模型使用不当导致，另有9.3%属于开发者主动覆盖的有效请求。基于这些发现，我们提出了突破当前局限的多项策略，包括优化训练目标、增强对拼写错误的鲁棒性、采用混合方法以及改进实现与可用性。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日