Recently, pre-trained large language models (LLMs) have shown impressive abilities in generating codes from natural language descriptions, repairing buggy codes, translating codes between languages, and retrieving relevant code segments. However, the evaluation of these models has often been performed in a scattered way on only one or two specific tasks, in a few languages, at a partial granularity (e.g., function) level, and in many cases without proper training data. Even more concerning is that in most cases the evaluation of generated codes has been done in terms of mere lexical overlap with a reference code rather than actual execution. We introduce xCodeEval, the largest executable multilingual multitask benchmark to date consisting of $25$M document-level coding examples ($16.5$B tokens) from about $7.5$K unique problems covering up to $11$ programming languages with execution-level parallelism. It features a total of $7$ tasks involving code understanding, generation, translation and retrieval. xCodeEval adopts an execution-based evaluation and offers a multilingual code execution engine, ExecEval that supports unit test based execution in all the $11$ languages. To address the challenge of balancing the distributions of text-code samples over multiple attributes in validation/test sets, we propose a novel data splitting and a data selection schema based on the geometric mean and graph-theoretic principle. Our experiments with OpenAI's LLMs (zero-shot) and open-LLMs (zero-shot and fine-tuned) on the tasks and languages demonstrate **xCodeEval** to be quite challenging as per the current advancements in language models.
翻译:近期,预训练大型语言模型在根据自然语言描述生成代码、修复缺陷代码、跨语言代码翻译及检索相关代码片段方面展现出卓越能力。然而,现有模型评估普遍存在碎片化问题:仅针对一两个特定任务、少数编程语言、部分粒度层面(如函数级),且多数场合缺乏规范训练数据。更值得注意的是,生成代码的评估大多仅依赖与参考代码的词汇重叠度,而非实际执行效果。为此,我们提出xCodeEval——迄今规模最大的可执行多语言多任务基准测试,包含来自约7500个独立问题的2500万个文档级编程示例(165亿词元),覆盖11种编程语言,支持执行级并行处理。该基准设计涵盖代码理解、生成、翻译与检索共7类任务。我们采用基于执行的评估范式,并开发了多语言代码执行引擎ExecEval,支持全部11种语言的单元测试执行。为应对验证/测试集中文本-代码样本在多个属性分布不均衡的挑战,我们创新性地提出基于几何平均与图论原理的数据划分与选择方案。通过OpenAI大语言模型(零样本)及开源大模型(零样本与微调)在各项任务与语言上的实验表明,xCodeEval对当前语言模型的发展水平具有显著挑战性。