The ability to solve problems is a hallmark of intelligence and has been an enduring goal in AI. AI systems that can create programs as solutions to problems or assist developers in writing programs can increase productivity and make programming more accessible. Recently, pre-trained large language models have shown impressive abilities in generating new codes from natural language descriptions, repairing buggy codes, translating codes between languages, and retrieving relevant code segments. However, the evaluation of these models has often been performed in a scattered way on only one or two specific tasks, in a few languages, at a partial granularity (e.g., function) level and in many cases without proper training data. Even more concerning is that in most cases the evaluation of generated codes has been done in terms of mere lexical overlap rather than actual execution whereas semantic similarity (or equivalence) of two code segments depends only on their ``execution similarity'', i.e., being able to get the same output for a given input.
翻译:问题求解能力是智能的标志,也是人工智能领域长期追求的目标。能够将程序作为问题解决方案的AI系统,或能辅助开发者编写程序的AI系统,可提升生产力并降低编程门槛。近年来,预训练大语言模型展现出从自然语言描述生成新代码、修复缺陷代码、跨语言翻译代码及检索相关代码片段等惊人能力。然而,这些模型的评估往往零散地集中在少数语言的一两个特定任务上,粒度有限(如函数级),且多数情况下缺乏合适的训练数据。更令人担忧的是,生成代码的评估大多仅基于词汇重叠而非实际执行,而两个代码片段的语义相似性(或等价性)仅取决于它们的“执行相似性”——即对给定输入能否产生相同输出。