ChatGPT demonstrates immense potential to transform software engineering (SE) by exhibiting outstanding performance in tasks such as code and document generation. However, the high reliability and risk control requirements of SE make the lack of interpretability for ChatGPT a concern. To address this issue, we carried out a study evaluating ChatGPT's capabilities and limitations in SE. We broke down the abilities needed for AI models to tackle SE tasks into three categories: 1) syntax understanding, 2) static behavior understanding, and 3) dynamic behavior understanding. Our investigation focused on ChatGPT's ability to comprehend code syntax and semantic structures, including abstract syntax trees (AST), control flow graphs (CFG), and call graphs (CG). We assessed ChatGPT's performance on cross-language tasks involving C, Java, Python, and Solidity. Our findings revealed that while ChatGPT excels at understanding code syntax (AST), it struggles with comprehending code semantics, particularly dynamic semantics. We conclude that ChatGPT possesses capabilities akin to an Abstract Syntax Tree (AST) parser, demonstrating initial competencies in static code analysis. Additionally, our study highlights that ChatGPT is susceptible to hallucination when interpreting code semantic structures and fabricating non-existent facts. These results underscore the need to explore methods for verifying the correctness of ChatGPT's outputs to ensure its dependability in SE. More importantly, our study provide an iniital answer why the generated codes from LLMs are usually synatx correct but vulnerabale.
翻译:ChatGPT在代码生成和文档编写等任务中展现出卓越性能,显示出其变革软件工程(SE)的巨大潜力。然而,软件工程对高可靠性和风险控制的严格要求,使得ChatGPT缺乏可解释性成为一大隐忧。针对这一问题,我们开展了一项研究,评估ChatGPT在软件工程中的能力与局限性。我们将人工智能模型处理软件工程任务所需的能力分解为三类:1)语法理解,2)静态行为理解,以及3)动态行为理解。研究重点聚焦于ChatGPT对代码语法和语义结构的理解能力,涵盖抽象语法树(AST)、控制流图(CFG)和调用图(CG)。我们评估了ChatGPT在涉及C、Java、Python和Solidity的跨语言任务中的表现。研究结果表明,尽管ChatGPT擅长理解代码语法(AST),但在理解代码语义(尤其是动态语义)方面存在困难。我们得出结论:ChatGPT具备类似抽象语法树(AST)解析器的能力,在静态代码分析方面展现出初步能力。此外,我们的研究还指出,ChatGPT在解释代码语义结构时容易产生幻觉,并编造不存在的客观事实。这些结果凸显了探索验证ChatGPT输出正确性方法的必要性,以确保其在软件工程中的可靠性。更重要的是,我们的研究初步解答了为何大语言模型生成的代码通常语法正确但存在漏洞这一问题。