The Limits of Long-Context Reasoning in Automated Bug Fixing

Rapidly increasing context lengths have led to the assumption that large language models (LLMs) can directly reason over entire codebases. Concurrently, recent advances in LLMs have enabled strong performance on software engineering benchmarks, particularly when paired with agentic workflows. In this work, we systematically evaluate whether current LLMs can reliably perform long-context code debugging and patch generation. Using SWE-bench Verified as a controlled experimental setting, we first evaluate state-of-the-art models within an agentic harness (mini-SWE-agent), where performance improves substantially: GPT-5-nano achieves up to a 31\% resolve rate on 100 samples, and open-source models such as Deepseek-R1-0528 obtain competitive results. However, token-level analysis shows that successful agentic trajectories typically remain under 20k tokens, and that longer accumulated contexts correlate with lower success rates, indicating that agentic success primarily arises from task decomposition into short-context steps rather than effective long-context reasoning. To directly test long-context capability, we construct a data pipeline where we artificially inflate the context length of the input by placing the relevant files into the context (ensuring perfect retrieval recall); we then study single-shot patch generation under genuinely long contexts (64k-128k tokens). Despite this setup, performance degrades sharply: Qwen3-Coder-30B-A3B achieves only a 7\% resolve rate at 64k context, while GPT-5-nano solves none of the tasks. Qualitative analysis reveals systematic failure modes, including hallucinated diffs, incorrect file targets, and malformed patch headers. Overall, our findings highlight a significant gap between nominal context length and usable context capacity in current LLMs, and suggest that existing agentic coding benchmarks do not meaningfully evaluate long-context reasoning.

翻译：随着上下文长度的快速增加，人们普遍认为大型语言模型（LLMs）能够直接对整个代码库进行推理。与此同时，LLMs 的最新进展使其在软件工程基准测试中表现出色，尤其是在与智能体工作流结合使用时。在本研究中，我们系统性地评估了当前 LLMs 是否能够可靠地执行长上下文代码调试和补丁生成。以 SWE-bench Verified 作为受控实验环境，我们首先在智能体框架（mini-SWE-agent）内评估了最先进的模型，其性能显著提升：GPT-5-nano 在 100 个样本上实现了高达 31% 的解决率，而开源模型如 Deepseek-R1-0528 也获得了有竞争力的结果。然而，令牌级别的分析表明，成功的智能体轨迹通常保持在 20k 令牌以内，且更长的累积上下文与较低的成功率相关，这表明智能体的成功主要源于将任务分解为短上下文步骤，而非有效的长上下文推理。为了直接测试长上下文能力，我们构建了一个数据管道，通过将相关文件放入上下文（确保完美的检索召回率）人为地增加输入上下文长度；随后，我们在真正的长上下文（64k-128k 令牌）条件下研究单次补丁生成。尽管如此，性能急剧下降：Qwen3-Coder-30B-A3B 在 64k 上下文下仅实现 7% 的解决率，而 GPT-5-nano 未能解决任何任务。定性分析揭示了系统性的失败模式，包括幻觉差异、错误的目标文件和格式错误的补丁头。总体而言，我们的研究结果突显了当前 LLMs 在名义上下文长度与可用上下文容量之间存在显著差距，并表明现有的智能体编码基准测试并未有效评估长上下文推理能力。