Mutation-based Consistency Testing for Evaluating the Code Understanding Capability of LLMs

Large Language Models (LLMs) have shown remarkable capabilities in processing both natural and programming languages, which have enabled various applications in software engineering, such as requirement engineering, code generation, and software testing. However, existing code generation benchmarks do not necessarily assess the code understanding performance of LLMs, especially for the subtle inconsistencies that may arise between code and its semantics described in natural language. In this paper, we propose a novel method to systematically assess the code understanding performance of LLMs, particularly focusing on subtle differences between code and its descriptions, by introducing code mutations to existing code generation datasets. Code mutations are small changes that alter the semantics of the original code, creating a mismatch with the natural language description. We apply different types of code mutations, such as operator replacement and statement deletion, to generate inconsistent code-description pairs. We then use these pairs to test the ability of LLMs to correctly detect the inconsistencies. We propose a new LLM testing method, called Mutation-based Consistency Testing (MCT), and conduct a case study on the two popular LLMs, GPT-3.5 and GPT-4, using the state-of-the-art code generation benchmark, HumanEval-X, which consists of six programming languages (Python, C++, Java, Go, JavaScript, and Rust). We compare the performance of the LLMs across different types of code mutations and programming languages and analyze the results. We find that the LLMs show significant variation in their code understanding performance and that they have different strengths and weaknesses depending on the mutation type and language.

翻译：大语言模型（LLMs）在处理自然语言和编程语言方面展现出卓越能力，这使其在软件工程领域催生了诸多应用，如需求工程、代码生成和软件测试。然而，现有的代码生成基准测试并不一定能评估LLMs的代码理解性能，尤其是对于代码与其自然语言描述之间可能存在的细微不一致性。本文提出了一种新方法，通过对现有代码生成数据集引入代码突变，系统性地评估LLMs的代码理解性能，重点关注代码与其描述之间的细微差异。代码突变是指改变原始代码语义的微小变更，从而导致代码与自然语言描述产生不匹配。我们应用了运算符替换、语句删除等不同类型的代码突变，生成不一致的代码-描述对，并利用这些配对测试LLMs正确检测不一致性的能力。我们提出了一种名为"基于突变的一致性测试（MCT）"的新LLM测试方法，并采用包含六种编程语言（Python、C++、Java、Go、JavaScript和Rust）的最先进代码生成基准测试HumanEval-X，对GPT-3.5和GPT-4这两个主流LLMs开展了案例研究。我们比较了LLMs在不同代码突变类型和编程语言下的表现，并分析了实验结果。研究发现，LLMs在代码理解性能上存在显著差异，且不同模型在不同突变类型和编程语言上展现出各自的优势与不足。