Large language models (LLMs) have been widely adopted across diverse domains of software engineering, such as code generation, program repair, and vulnerability detection. These applications require understanding beyond surface-level code patterns: value propagation, control flow, and interdependence between program elements. However, existing benchmarks primarily evaluate end-to-end outcomes, such as whether code is correctly repaired or generated, leaving the models' ability for program semantic reasoning underexplored. This work presents CORE, a high-quality, human-verified benchmark designed to evaluate LLMs on fundamental static analysis tasks. CORE includes 12,553 task instances spanning data dependency, control dependency, and information flow across programs written in C/C++, Java, and Python. To ensure semantic diversity and reasoning complexity, we propose a semantics-aware diverse sampling strategy that selects targets and task instances based on structural coverage and dependency depth. We evaluate 10 mainstream LLMs and show that, while they perform well at identifying dependencies, models still struggle with tasks that require deeper semantic understanding and multi-step reasoning. We further conduct qualitative analyses to uncover key challenges, such as complex control structures and backward dependency patterns, offering insights into improving LLMs' code reasoning capabilities.
翻译:大语言模型(LLMs)已广泛应用于软件工程的多个领域,如代码生成、程序修复和漏洞检测。这些应用要求超越表层代码模式的理解:需要理解值传播、控制流以及程序元素间的相互依赖关系。然而,现有基准测试主要评估端到端结果,例如代码是否被正确修复或生成,而对模型程序语义推理能力的探究尚不充分。本研究提出了CORE,一个高质量、经人工验证的基准测试,旨在评估大语言模型在基础静态分析任务上的表现。CORE包含12,553个任务实例,涵盖C/C++、Java和Python语言程序中的数据依赖、控制依赖和信息流分析。为确保语义多样性和推理复杂性,我们提出了一种语义感知的多样性采样策略,该策略基于结构覆盖率和依赖深度选择目标与任务实例。我们评估了10个主流大语言模型,结果表明,尽管模型在识别依赖关系方面表现良好,但在需要更深层语义理解和多步推理的任务上仍存在困难。我们进一步进行了定性分析,揭示了关键挑战,例如复杂控制结构和反向依赖模式,为提升大语言模型的代码推理能力提供了见解。