An Empirical Study of Large Language Models for Type and Call Graph Analysis

Large Language Models (LLMs) are increasingly being explored for their potential in software engineering, particularly in static analysis tasks. In this study, we investigate the potential of current LLMs to enhance call-graph analysis and type inference for Python and JavaScript programs. We empirically evaluated 24 LLMs, including OpenAI's GPT series and open-source models like LLaMA and Mistral, using existing and newly developed benchmarks. Specifically, we enhanced TypeEvalPy, a micro-benchmarking framework for type inference in Python, with auto-generation capabilities, expanding its scope from 860 to 77,268 type annotations for Python. Additionally, we introduced SWARM-CG and SWARM-JS, comprehensive benchmarking suites for evaluating call-graph construction tools across multiple programming languages. Our findings reveal a contrasting performance of LLMs in static analysis tasks. For call-graph generation in Python, traditional static analysis tools like PyCG significantly outperform LLMs. In JavaScript, the static tool TAJS underperforms due to its inability to handle modern language features, while LLMs, despite showing potential with models like mistral-large-it-2407-123b and GPT-4o, struggle with completeness and soundness in both languages for call-graph analysis. Conversely, LLMs demonstrate a clear advantage in type inference for Python, surpassing traditional tools like HeaderGen and hybrid approaches such as HiTyper. These results suggest that while LLMs hold promise in type inference, their limitations in call-graph analysis highlight the need for further research. Our study provides a foundation for integrating LLMs into static analysis workflows, offering insights into their strengths and current limitations.

翻译：大型语言模型（LLMs）在软件工程领域，特别是静态分析任务中的潜力正日益受到探索。本研究调查了当前LLMs在增强Python和JavaScript程序的调用图分析与类型推断方面的潜力。我们使用现有及新开发的基准测试，对24个LLMs进行了实证评估，包括OpenAI的GPT系列以及LLaMA和Mistral等开源模型。具体而言，我们增强了TypeEvalPy（一个用于Python类型推断的微基准测试框架），使其具备自动生成能力，将其覆盖范围从860个类型标注扩展到77,268个Python类型标注。此外，我们引入了SWARM-CG和SWARM-JS，这是用于评估跨多种编程语言的调用图构建工具的综合基准测试套件。我们的研究结果揭示了LLMs在静态分析任务中表现出的对比性。在Python的调用图生成方面，传统的静态分析工具（如PyCG）显著优于LLMs。在JavaScript中，静态工具TAJS由于无法处理现代语言特性而表现不佳，而LLMs（尽管在mistral-large-it-2407-123b和GPT-4o等模型中显示出潜力）在两种语言的调用图分析中，在完备性和可靠性方面均存在困难。相反，LLMs在Python的类型推断方面展现出明显优势，超越了HeaderGen等传统工具以及HiTyper等混合方法。这些结果表明，尽管LLMs在类型推断方面前景广阔，但它们在调用图分析中的局限性凸显了进一步研究的必要性。我们的研究为将LLMs集成到静态分析工作流奠定了基础，并深入揭示了其优势与当前局限。