Large language models (LLMs) are increasingly deployed for understanding large codebases, but whether they understand operational semantics of long code context or rely on pattern matching shortcuts remains unclear. We distinguish between lexical recall (retrieving code verbatim) and semantic recall (understanding operational semantics). Evaluating 10 state-of-the-art LLMs, we find that while frontier models achieve near-perfect, position-independent lexical recall, semantic recall degrades severely when code is centrally positioned in long contexts. We introduce semantic recall sensitivity to measure whether tasks require understanding of code's operational semantics vs. permit pattern matching shortcuts. Through a novel counterfactual measurement method, we show that models rely heavily on pattern matching shortcuts to solve existing code understanding benchmarks. We propose a new task SemTrace, which achieves high semantic recall sensitivity through unpredictable operations; LLMs' accuracy exhibits severe positional effects, with median accuracy drops of 92.73% versus CRUXEval's 53.36% as the relevant code snippet approaches the middle of the input code context. Our findings suggest current evaluations substantially underestimate semantic recall failures in long context code understanding.
翻译:大型语言模型(LLMs)正越来越多地被用于理解大型代码库,但它们究竟是在理解长代码上下文中的操作语义,还是依赖于模式匹配的捷径,目前尚不明确。我们区分了词汇回忆(逐字检索代码)和语义回忆(理解操作语义)。通过对10个前沿LLMs进行评估,我们发现,尽管前沿模型实现了近乎完美且与位置无关的词汇回忆,但当代码位于长上下文的中心位置时,语义回忆会严重下降。我们引入了语义回忆敏感度这一指标,用于衡量任务是需要理解代码的操作语义,还是允许模式匹配的捷径。通过一种新颖的反事实测量方法,我们表明模型严重依赖模式匹配的捷径来解决现有的代码理解基准测试。我们提出了一个新任务SemTrace,它通过不可预测的操作实现了高语义回忆敏感度;LLMs的准确性表现出严重的位置效应,当相关代码片段接近输入代码上下文的中间位置时,其准确率中位数下降了92.73%,而CRUXEval的下降幅度为53.36%。我们的研究结果表明,当前的评估方法严重低估了长上下文代码理解中语义回忆的失败情况。