AI systems that can create codes as solutions to problems or assist developers in writing codes can increase productivity and make programming more accessible. Recently, pre-trained large language models have shown impressive abilities in generating codes from natural language descriptions, repairing buggy codes, translating codes between languages, and retrieving relevant code segments. However, the evaluation of these models has often been performed in a scattered way on only one or two specific tasks, in a few languages, at a partial granularity (e.g., function) level, and in many cases without proper training data. Even more concerning is that in most cases the evaluation of generated codes has been done in terms of mere lexical overlap with a reference code rather than actual execution. We introduce xCodeEval, the largest executable multilingual multitask benchmark to date consisting of 25M document-level coding examples (16.5B tokens) from about 7.5K unique problems covering up to 11 programming languages with execution-level parallelism. It features a total of seven tasks involving code understanding, generation, translation and retrieval. xCodeEval adopts an execution-based evaluation and offers a multilingual code execution engine, ExecEval that supports unit test based execution in all the 11 languages. To address the challenge of balancing the distributions of text-code samples over multiple attributes in validation/test sets, we further propose a novel data splitting and a data selection schema based on the geometric mean and graph-theoretic principle. Experimental results on all the tasks and languages show xCodeEval is a promising yet challenging benchmark as per the current advancements in language models.
翻译:摘要:能够为问题生成代码或协助开发者编写代码的AI系统可以提升生产力,并使编程更加普及。近年来,预训练大规模语言模型在从自然语言描述生成代码、修复缺陷代码、跨语言代码翻译以及检索相关代码片段等方面展现出卓越能力。然而,对这些模型的评估往往较为分散,通常仅针对一两个特定任务、少数编程语言、部分粒度(如函数级),且多数情况下缺乏适当的训练数据。更令人担忧的是,大多数评估仅通过生成代码与参考代码之间的词汇重叠程度进行,而非实际执行。为此,我们提出xCodeEval——目前规模最大的可执行多语言多任务基准测试,包含来自约7500个独立问题的2500万个文档级编程示例(165亿词元),覆盖11种编程语言,并支持执行级并行处理。该基准涵盖代码理解、生成、翻译与检索在内的共七项任务。xCodeEval采用基于执行的评估方式,并配套提供多语言代码执行引擎ExecEval,支持所有11种语言的单元测试执行。为解决验证集/测试集中文本-代码样本多属性分布均衡难题,我们进一步提出基于几何均值与图论原则的新型数据分割与数据选择方案。跨所有任务与语言的实验结果表明,xCodeEval在现有语言模型发展水平下既具前景性又具挑战性。