The application of Large Language Models (LLMs) in software engineering, particularly in static analysis tasks, represents a paradigm shift in the field. In this paper, we investigate the role that current LLMs can play in improving callgraph analysis and type inference for Python programs. Using the PyCG, HeaderGen, and TypeEvalPy micro-benchmarks, we evaluate 26 LLMs, including OpenAI's GPT series and open-source models such as LLaMA. Our study reveals that LLMs show promising results in type inference, demonstrating higher accuracy than traditional methods, yet they exhibit limitations in callgraph analysis. This contrast emphasizes the need for specialized fine-tuning of LLMs to better suit specific static analysis tasks. Our findings provide a foundation for further research towards integrating LLMs for static analysis tasks.
翻译:大型语言模型(LLMs)在软件工程中的应用,尤其是静态分析任务中的突破,代表了该领域的范式转变。本文探究当前LLMs在提升Python程序的调用图分析与类型推断中可扮演的角色。通过PyCG、HeaderGen和TypeEvalPy微基准测试,我们评估了包括OpenAI GPT系列和LLaMA等开源模型在内的26个LLMs。研究表明,LLMs在类型推断中表现出优于传统方法的准确率,但在调用图分析中存在局限性。这种对比凸显了需要针对特定静态分析任务对LLMs进行专门微调。我们的发现为后续将LLMs集成至静态分析任务的研究奠定了基础。