Code Large Language Models (Code LLMs) are being increasingly employed in real-life applications, so evaluating them is critical. While the general accuracy of Code LLMs on individual tasks has been extensively evaluated, their self-consistency across different tasks is overlooked. Intuitively, a trustworthy model should be self-consistent when generating natural language specifications for its own code and generating code for its own specifications. Failure to preserve self-consistency reveals a lack of understanding of the shared semantics underlying natural language and programming language, and therefore undermines the trustworthiness of a model. In this paper, we first formally define the self-consistency of Code LLMs and then design a framework, IdentityChain, which effectively and efficiently evaluates the self-consistency and general accuracy of a model at the same time. We study eleven Code LLMs and show that they fail to preserve self-consistency, which is indeed a distinct aspect from general accuracy. Furthermore, we show that IdentityChain can be used as a model debugging tool to expose weaknesses of Code LLMs by demonstrating three major weaknesses that we identify in current models using IdentityChain. Our code is available at https://github.com/marcusm117/IdentityChain.
翻译:代码大语言模型正越来越多地应用于实际场景中,因此对其进行评估至关重要。尽管代码大语言模型在单个任务上的总体准确率已被广泛评估,但其在不同任务间的自我一致性却被忽视。直观而言,一个可信的模型应在为其自身代码生成自然语言规范以及为其自身规范生成代码时保持自我一致性。无法保持自我一致性表明模型缺乏对自然语言与编程语言共享语义的理解,进而削弱了模型的可信度。本文首先正式定义了代码大语言模型的自我一致性,随后设计了一个名为IdentityChain的框架,该框架能够同时高效且有效地评估模型的自我一致性和总体准确率。我们研究了十一种代码大语言模型,发现它们均无法保持自我一致性,而自我一致性确实是与总体准确率截然不同的评估维度。此外,我们展示了IdentityChain可作为模型调试工具,通过揭示使用IdentityChain在当前模型中识别出的三大主要弱点,来暴露代码大语言模型的缺陷。我们的代码开源在https://github.com/marcusm117/IdentityChain。