In recent times, a plethora of Large Code Generation Models (LCGMs) have been proposed, showcasing significant potential in assisting developers with complex programming tasks. Benchmarking LCGMs necessitates the creation of a set of diverse programming problems, and each problem comprises the prompt (including the task description), canonical solution, and test inputs. The existing methods for constructing such a problem set can be categorized into two main types: manual methods and perturbation-based methods. However, manual methods demand high effort and lack scalability, while also risking data integrity due to LCGMs' potentially contaminated data collection, and perturbation-based approaches mainly generate semantically homogeneous problems with the same canonical solutions and introduce typos that can be easily auto-corrected by IDE, making them ineffective and unrealistic. In this work, we propose the idea of programming problem merging (PPM) and provide two implementation of this idea, we utilize our tool on two widely-used datasets and compare it against nine baseline methods using eight code generation models. The results demonstrate the effectiveness of our tool in generating more challenging, diverse, and natural programming problems, comparing to the baselines.
翻译:近年来,大量大型代码生成模型被提出,展现出协助开发者完成复杂编程任务的巨大潜力。基准测试这些模型需要构建一组多样化的编程问题,每个问题包含提示(含任务描述)、规范解答和测试输入。现有构建此类问题集的方法主要分为两类:人工方法和基于扰动的方法。然而,人工方法成本高且缺乏可扩展性,同时因大型代码生成模型可能污染数据收集而存在数据完整性风险;基于扰动的方法主要生成语义同质的问题(具有相同的规范解答),并引入集成开发环境可自动修正的拼写错误,因而效果不佳且不切实际。本研究提出编程问题合并的思想,并给出两种实现方案。在两个广泛使用的数据集上应用该工具,并使用八个代码生成模型与九种基线方法进行对比。结果表明,与基线方法相比,我们的工具能够生成更具挑战性、更多样化且更自然的编程问题。