The ability to solve problems is a hallmark of intelligence and has been an enduring goal in AI. AI systems that can create programs as solutions to problems or assist developers in writing programs can increase productivity and make programming more accessible. Recently, pre-trained large language models have shown impressive abilities in generating new codes from natural language descriptions, repairing buggy codes, translating codes between languages, and retrieving relevant code segments. However, the evaluation of these models has often been performed in a scattered way on only one or two specific tasks, in a few languages, at a partial granularity (e.g., function) level and in many cases without proper training data. Even more concerning is that in most cases the evaluation of generated codes has been done in terms of mere lexical overlap rather than actual execution whereas semantic similarity (or equivalence) of two code segments depends only on their ``execution similarity'', i.e., being able to get the same output for a given input.
翻译:解决问题的能力是智能的标志,也是人工智能领域的长期目标。能够将程序作为问题解决方案的AI系统,或能协助开发人员编写程序的系统,既可提高生产力,也能降低编程门槛。近年来,预训练大语言模型展现出从自然语言描述生成新代码、修复缺陷代码、跨语言翻译代码以及检索相关代码段的卓越能力。然而,对这些模型的评估往往仅针对一两个特定任务、少数语言、部分粒度(如函数级别)进行,且多数情况下缺乏适当的训练数据。更令人担忧的是,大多数评估仅基于词汇重叠而非实际执行,而两段代码的语义相似性(或等价性)完全取决于其“执行相似性”——即对给定输入能否产生相同输出。