Transformer-based language models for automatic code completion have shown great promise so far, yet the evaluation of these models rarely uses real data. This study provides both quantitative and qualitative assessments of three public code language models when completing real-world code. We first developed an open-source IDE extension, Code4Me, for the online evaluation of the models. We collected real auto-completion usage data for over a year from more than 1200 users, resulting in over 600K valid completions. These models were then evaluated using six standard metrics across twelve programming languages. Next, we conducted a qualitative study of 1690 real-world completion requests to identify the reasons behind the poor model performance. A comparative analysis of the models' performance in online and offline settings was also performed, using benchmark synthetic datasets and two masking strategies. Our findings suggest that while developers utilize code completion across various languages, the best results are achieved for mainstream languages such as Python and Java. InCoder outperformed the other models across all programming languages, highlighting the significance of training data and objectives. Our study also revealed that offline evaluations do not accurately reflect real-world scenarios. Upon qualitative analysis of the model's predictions, we found that 66.3% of failures were due to the models' limitations, 24.4% occurred due to inappropriate model usage in a development context, and 9.3% were valid requests that developers overwrote. Given these findings, we propose several strategies to overcome the current limitations. These include refining training objectives, improving resilience to typographical errors, adopting hybrid approaches, and enhancing implementations and usability.
翻译:基于Transformer的代码自动补全语言模型至今展现出巨大潜力,然而这些模型的评估很少使用真实数据。本研究对三种公开代码语言模型在完成真实世界代码时的表现进行了定量与定性评估。我们首先开发了一款开源IDE扩展Code4Me用于在线评估模型,在超过一年的时间里收集了来自1200多名用户的真实自动补全使用数据,最终获得超过60万次有效补全。这些模型随后使用六项标准指标在十二种编程语言上进行了评估。接着,我们对1690个真实世界代码补全请求进行质性研究,以识别模型性能不佳的深层原因。同时,通过基准合成数据集与两种掩码策略,对模型在在线与离线环境下的性能进行了对比分析。研究发现:虽然开发者在不同编程语言中都会使用代码补全,但Python和Java等主流语言的效果最佳;InCoder模型在所有编程语言中的表现均优于其他模型,凸显了训练数据与训练目标的重要性。研究还发现离线评估无法准确反映真实场景。通过对模型预测结果的质性分析,我们观察到66.3%的失败源于模型自身局限,24.4%由开发环境中模型使用不当导致,另有9.3%属于开发者主动覆盖的有效请求。基于这些发现,我们提出了突破当前局限的多项策略,包括优化训练目标、增强对拼写错误的鲁棒性、采用混合方法以及改进实现与可用性。