An Empirical Study on Capability of Large Language Models in Understanding Code Semantics

Large Language Models for Code (code LLMs) have demonstrated remarkable performance across various software engineering (SE) tasks, increasing the application of code LLMs in software development. Despite the success of code LLMs, there remain significant concerns about the actual capabilities and reliability of these models, "whether these models really learn the semantics of code from the training data and leverage the learned knowledge to perform the SE tasks". In this paper, we introduce EMPICA, a comprehensive framework designed to systematically and empirically evaluate the capabilities of code LLMs in understanding code semantics. Specifically, EMPICA systematically introduces controlled modifications/transformations into the input code and examines the models' responses. Generally, code LLMs must be robust to semantically equivalent code inputs and be sensitive to non-equivalent ones for all SE tasks. Specifically, for every SE task, given an input code snippet c and its semantic equivalent variants, code LLMs must robustly produce consistent/equivalent outputs while they are expected to generate different outputs for c and its semantic non-equivalent variants. Our experimental results on three representative code understanding tasks, including code summarization, method name prediction, and output prediction, reveal that the robustness and sensitivity of the state-of-the-art code LLMs to code transformations vary significantly across tasks and transformation operators. In addition, the code LLMs exhibit better robustness to the semantic preserving transformations than their sensitivity to the semantic non-preserving transformations. These results highlight a need to enhance the model's capabilities of understanding code semantics, especially the sensitivity property.

翻译：代码大型语言模型（code LLMs）在各种软件工程任务中展现出卓越性能，推动了其在软件开发中的广泛应用。尽管代码LLMs取得了成功，但关于这些模型的实际能力与可靠性仍存在重大关切，即"这些模型是否真正从训练数据中学习了代码语义，并利用所学知识执行软件工程任务"。本文提出EMPICA框架，这是一个为系统化、实证化评估代码LLMs理解代码语义能力而设计的综合框架。具体而言，EMPICA通过系统性地在输入代码中引入受控修改/变换，并检测模型的响应行为。理论上，对于所有软件工程任务，代码LLMs必须对语义等价的代码输入保持鲁棒性，同时对非等价输入保持敏感性。具体来说，对于每个软件工程任务，给定输入代码片段c及其语义等价变体，代码LLMs应稳定生成一致/等价的输出；而对于c及其语义非等价变体，模型应产生差异化输出。我们在三项代表性代码理解任务（包括代码摘要生成、方法名预测和输出预测）上的实验结果表明：最先进的代码LLMs对代码变换的鲁棒性与敏感性在不同任务和变换算子间存在显著差异。此外，代码LLMs对语义保持变换的鲁棒性表现优于对语义非保持变换的敏感性表现。这些结果凸显了提升模型代码语义理解能力（特别是敏感性）的必要性。