This paper investigates code LLMs' capability of static analysis during code intelligence tasks such as code summarization and generation. Code LLMs are now household names for their abilities to do some programming tasks that have heretofore required people. The process that people follow to do programming tasks has long been understood to require static analysis. For example, human programmers navigate the call graph of large programs to comprehend the different parts of those programs. Education in programming includes static analysis under the assumption that better static analysis skills beget better programming. While popular culture is replete with anthropomorphic references such as LLM ``reasoning'', in fact code LLMs could exhibit a wholly alien thought process to humans. This paper studies the specific question of static analysis by code LLMs. We use three different static analysis tasks (callgraph generation, AST generation, and dataflow generation) and three different code intelligence tasks (code generation, summarization, and translation) with two different open-source models (Gemini and GPT-4o) and closed-source models (CodeLlaMA and Jam) as our experiments. We found that LLMs show poor performance on static analysis tasks and that pretraining on the static analysis tasks does not generalize to better performance on the code intelligence tasks and vice versa.
翻译:本文探究代码大语言模型在代码智能任务(如代码摘要与生成)中的静态分析能力。代码大语言模型因能执行此前需人类参与的编程任务而广为人知。人们完成编程任务所遵循的过程长期以来被认为需要静态分析。例如,人类程序员通过浏览大型程序的调用图来理解程序的不同部分。编程教育包含静态分析,其前提是更好的静态分析技能能造就更优秀的编程能力。尽管流行文化中充斥着诸如大语言模型“推理”等拟人化提法,但实际上代码大语言模型可能展现出与人类完全不同的思维过程。本文重点研究代码大语言模型的静态分析能力。我们采用三种不同的静态分析任务(调用图生成、抽象语法树生成及数据流生成)与三种不同的代码智能任务(代码生成、摘要及翻译),并选择两个开源模型(Gemini与GPT-4o)及两个闭源模型(CodeLlaMA与Jam)进行实验。研究发现,大语言模型在静态分析任务中表现欠佳,且静态分析任务的预训练并未泛化提升代码智能任务的性能,反之亦然。