Generative Large Language Models (LLMs) are increasingly used in non-generative software maintenance tasks, such as fault localization (FL). Success in FL depends on a models ability to reason about program semantics beyond surface-level syntactic and lexical features. However, widely used LLM benchmarks primarily evaluate code generation, which differs fundamentally from semantic program reasoning. Meanwhile, traditional FL benchmarks such as Defect4J and BugsInPy are either not scalable or obsolete, as their datasets have become part of LLM training data, leading to biased results. This paper presents the first large-scale empirical investigation into the robustness of LLMs fault localizability. Inspired by mutation testing, we develop an end-to-end evaluation framework that addresses key limitations in existing LLM evaluation, including data contamination, scalability, automation, and extensibility. Using real-world programs with specifications, we inject unseen faults and ask LLMs to localize them, filtering out underspecified programs where localization is ambiguous. For each successfully localized program, we apply semantic-preserving mutations (SPMs) and rerun localization to assess robustness and determine whether LLM reasoning relies on syntactic cues rather than semantics. We evaluate 10 state-of-the-art LLMs on 750,013 fault localization tasks from over 1,300 Java and Python programs. We find that SPMs cause LLMs to fail on previously localized faults in 78% of cases, and that reasoning is stronger when relevant code appears earlier in context. These results indicate that LLM code reasoning is often tied to features irrelevant to semantics. We also identify code patterns that are challenging for LLMs to reason about. Overall, our findings motivate fundamental advances in how LLMs represent, interpret, and prioritize code semantics to reason more deeply about program logic
翻译:生成式大型语言模型正日益应用于非生成式软件维护任务,如故障定位。故障定位的成功取决于模型超越表层语法和词汇特征进行程序语义推理的能力。然而,当前广泛使用的LLM基准主要评估代码生成能力,这与程序语义推理存在本质差异。同时,传统故障定位基准(如Defect4J和BugsInPy)或因规模受限而过时,其数据集已成为LLM训练数据的一部分,导致评估结果存在偏差。本文首次对LLM故障定位能力的鲁棒性进行了大规模实证研究。受变异测试启发,我们开发了一个端到端评估框架,解决了现有LLM评估在数据污染、可扩展性、自动化和可扩展性方面的关键局限。基于带规范说明的真实程序,我们注入未见过的故障并要求LLM进行定位,同时过滤掉因规范不明确导致定位模糊的程序。针对每个成功定位的程序,我们应用语义保持变异并重新运行定位流程,以评估鲁棒性并判断LLM推理是否依赖语法线索而非语义理解。我们在超过1,300个Java和Python程序生成的750,013个故障定位任务上评估了10个前沿LLM。研究发现:语义保持变异导致LLM在78%的先前已定位故障上失效;当相关代码在上下文中较早出现时,模型推理能力更强。这些结果表明LLM的代码推理常依赖于与语义无关的特征。我们还识别出LLM难以推理的代码模式。总体而言,本研究揭示了LLM在代码语义表示、解释和优先级排序方面的根本性改进需求,以推动其实现更深层次的程序逻辑推理。