Large language models~(LLMs) demonstrate significant potential to revolutionize software engineering (SE) by exhibiting outstanding performance in SE tasks such as code and document generation. However, the high reliability and risk control requirements in software engineering raise concerns about the lack of interpretability of LLMs. To address this concern, we conducted a study to evaluate the capabilities of LLMs and their limitations for code analysis in SE. We break down the abilities needed for artificial intelligence~(AI) models to address SE tasks related to code analysis into three categories: 1) syntax understanding, 2) static behavior understanding, and 3) dynamic behavior understanding. Our investigation focused on the ability of LLMs to comprehend code syntax and semantic structures, which include abstract syntax trees (AST), control flow graphs (CFG), and call graphs (CG). We employed four state-of-the-art foundational models, GPT4, GPT3.5, StarCoder and CodeLlama-13b-instruct. We assessed the performance of LLMs on cross-language tasks involving C, Java, Python, and Solidity. Our findings revealed that while LLMs have a talent for understanding code syntax, they struggle with comprehending code semantics, particularly dynamic semantics. We conclude that LLMs possess capabilities similar to an Abstract Syntax Tree (AST) parser, demonstrating initial competencies in static code analysis. Furthermore, our study highlights that LLMs are susceptible to hallucinations when interpreting code semantic structures and fabricating nonexistent facts. These results indicate the need to explore methods to verify the correctness of LLM output to ensure its dependability in SE. More importantly, our study provides an initial answer to why the codes generated by LLM are usually syntax-correct but vulnerable.
翻译:大型语言模型(LLMs)在代码与文档生成等软件工程(SE)任务中展现出卓越性能,具有革新软件工程的巨大潜力。然而,软件工程对高可靠性与风险控制的要求引发了对LLMs可解释性不足的担忧。为应对这一问题,我们开展了一项研究,评估LLMs在软件工程代码分析中的能力及其局限性。我们将人工智能(AI)模型解决代码分析相关SE任务所需的能力分解为三类:1)语法理解、2)静态行为理解、3)动态行为理解。研究聚焦于LLMs理解代码语法与语义结构(包括抽象语法树AST、控制流图CFG和调用图CG)的能力。我们采用了四种最先进的基础模型:GPT4、GPT3.5、StarCoder和CodeLlama-13b-instruct,并评估了LLMs在跨语言任务(涉及C、Java、Python和Solidity)中的表现。研究发现,尽管LLMs擅长理解代码语法,但在理解代码语义(尤其是动态语义)方面存在困难。我们得出结论:LLMs具备类似抽象语法树(AST)解析器的能力,展现出静态代码分析的初步能力。此外,研究还表明LLMs在解释代码语义结构时容易出现幻觉,并编造不存在的实情。这些结果提示需探索验证LLM输出正确性的方法,以确保其在软件工程中的可靠性。更重要的是,我们的研究初步揭示了为何LLM生成的代码通常语法正确却存在安全漏洞。