Recently, large language models (LLMs) have shown strong potential in code generation tasks. However, there are still gaps before they can be fully applied in actual software development processes. Accurately assessing the code generation capabilities of large language models has become an important basis for evaluating and improving the models. Some existing works have constructed datasets to evaluate the capabilities of these models. However, the current evaluation process may encounter the illusion of "Specialist in Familiarity", primarily due to three gaps: the exposure of target code, case timeliness, and dependency availability. The fundamental reason for these gaps is that the code in current datasets may have been extensively exposed and exercised during the training phase, and due to the continuous training and development of LLM, their timeliness has been severely compromised. The key to solve the problem is to, as much as possible, evaluate the LLMs using code that they have not encountered before. Thus, the fundamental idea in this paper is to draw on the concept of code obfuscation, changing code at different levels while ensuring the functionality and output. To this end, we build a code-obfuscation based benchmark OBFUSEVAL. We first collect 1,354 raw cases from five real-world projects, including function description and code. Then we use three-level strategy (symbol, structure and semantic) to obfuscate descriptions, code and context dependencies. We evaluate four LLMs on OBFU- SEVAL and compared the effectiveness of different obfuscation strategy. We use official test suites of these projects to evaluate the generated code. The results show that after obfuscation, the average decrease ratio of test pass rate can up to 62.5%.
翻译:近期,大型语言模型(LLMs)在代码生成任务中展现出强大潜力。然而,在将其完全应用于实际软件开发流程之前,仍存在诸多差距。准确评估大语言模型的代码生成能力已成为评估和改进模型的重要基础。现有研究已构建多个数据集以评估这些模型的能力,但当前评估过程可能遭遇"熟悉领域专家"的错觉,这主要源于三个差距:目标代码的暴露性、案例时效性以及依赖可用性。这些差距的根本原因在于,当前数据集中的代码可能在训练阶段已被广泛暴露和练习,且由于LLM的持续训练与发展,其时效性已严重受损。解决该问题的关键在于尽可能使用模型未曾接触过的代码进行评估。因此,本文的核心思想借鉴代码混淆概念,在保证功能与输出的前提下对不同层级的代码进行变换。为此,我们构建了基于代码混淆的基准测试集OBFUSEVAL。我们首先从五个实际项目中收集了1,354个原始案例,包含函数描述与代码;随后采用三级混淆策略(符号级、结构级与语义级)对描述、代码及上下文依赖进行混淆处理。我们在OBFUSEVAL上评估了四个LLM,并比较了不同混淆策略的有效性。通过使用这些项目的官方测试套件评估生成代码,结果表明:经过混淆处理后,测试通过率的平均下降比例最高可达62.5%。