Large language models~(LLMs) demonstrate significant potential to revolutionize software engineering (SE) by exhibiting outstanding performance in SE tasks such as code and document generation. However, the high reliability and risk control requirements in software engineering raise concerns about the lack of interpretability of LLMs. To address this concern, we conducted a study to evaluate the capabilities of LLMs and their limitations for code analysis in SE. We break down the abilities needed for artificial intelligence~(AI) models to address SE tasks related to code analysis into three categories: 1) syntax understanding, 2) static behavior understanding, and 3) dynamic behavior understanding. Our investigation focused on the ability of LLMs to comprehend code syntax and semantic structures, which include abstract syntax trees (AST), control flow graphs (CFG), and call graphs (CG). We employed four state-of-the-art foundational models, GPT4, GPT3.5, StarCoder and CodeLlama-13b-instruct. We assessed the performance of LLMs on cross-language tasks involving C, Java, Python, and Solidity. Our findings revealed that while LLMs have a talent for understanding code syntax, they struggle with comprehending code semantics, particularly dynamic semantics. We conclude that LLMs possess capabilities similar to an Abstract Syntax Tree (AST) parser, demonstrating initial competencies in static code analysis. Furthermore, our study highlights that LLMs are susceptible to hallucinations when interpreting code semantic structures and fabricating nonexistent facts. These results indicate the need to explore methods to verify the correctness of LLM output to ensure its dependability in SE. More importantly, our study provides an initial answer to why the codes generated by LLM are usually syntax-correct but vulnerable.
翻译:大语言模型(LLMs)在代码与文档生成等软件工程任务中展现出卓越性能,具有革新软件工程领域的巨大潜力。然而,软件工程对高可靠性和风险控制的要求,引发了人们对LLMs缺乏可解释性的担忧。针对这一问题,我们开展了一项研究,评估LLMs在软件工程代码分析任务中的能力及其局限性。我们将人工智能模型解决与代码分析相关的软件工程任务所需的能力分为三类:1)语法理解,2)静态行为理解,3)动态行为理解。研究重点聚焦于LLMs理解代码语法与语义结构的能力,包括抽象语法树(AST)、控制流图(CFG)和调用图(CG)。我们采用四种最先进的基础模型:GPT4、GPT3.5、StarCoder和CodeLlama-13b-instruct,并评估了LLMs在涉及C、Java、Python和Solidity语言的跨语言任务中的性能。研究结果表明,尽管LLMs在代码语法理解方面表现出色,但在代码语义理解,特别是动态语义理解方面存在困难。我们得出结论:LLMs具备类似抽象语法树解析器的能力,在静态代码分析中展现出初步能力。此外,研究揭示LLMs在解释代码语义结构时容易产生幻觉现象,并编造不存在的客观事实。这些结果表明,需要探索验证LLM输出正确性的方法,以确保其在软件工程中的可靠性。更重要的是,本研究为解释"LLMs生成的代码通常语法正确但存在安全隐患"这一现象提供了初步答案。