Rapidly increasing context lengths have led to the assumption that large language models (LLMs) can directly reason over entire codebases. Concurrently, recent advances in LLMs have enabled strong performance on software engineering benchmarks, particularly when paired with agentic workflows. In this work, we systematically evaluate whether current LLMs can reliably perform long-context code debugging and patch generation. Using SWE-bench Verified as a controlled experimental setting, we first evaluate state-of-the-art models within an agentic harness (mini-SWE-agent), where performance improves substantially: GPT-5-nano achieves up to a 31\% resolve rate on 100 samples, and open-source models such as Deepseek-R1-0528 obtain competitive results. However, token-level analysis shows that successful agentic trajectories typically remain under 20k-30k tokens, and that longer accumulated contexts correlate with lower success rates, indicating that agentic success primarily arises from task decomposition into short-context steps rather than effective long-context reasoning. To directly test long-context capability, we construct a data pipeline where we artificially inflate the context length of the input by placing the relevant files into the context (ensuring perfect retrieval recall); we then study single-shot patch generation under genuinely long contexts (64k tokens). Despite this setup, performance degrades sharply: Qwen3-Coder-30B-A3B achieves only a 7\% resolve rate at 64k context, while GPT-5-nano solves none of the tasks. Qualitative analysis reveals systematic failure modes, including hallucinated diffs, incorrect file targets, and malformed patch headers. Overall, our findings highlight a significant gap between nominal context length and usable context capacity in current LLMs, and suggest that existing agentic coding benchmarks do not meaningfully evaluate long-context reasoning.
翻译:随着上下文长度的迅速增加,人们普遍认为大型语言模型(LLMs)能够直接对整个代码库进行推理。与此同时,LLMs的最新进展使其在软件工程基准测试中展现出强大的性能,尤其是在与智能体工作流结合时。在本研究中,我们系统性地评估了当前LLMs是否能够可靠地执行长上下文代码调试和补丁生成。我们以SWE-bench Verified作为受控实验环境,首先在智能体框架(mini-SWE-agent)中评估了最先进的模型,其性能显著提升:GPT-5-nano在100个样本上实现了高达31%的解决率,而开源模型如Deepseek-R1-0528也获得了具有竞争力的结果。然而,令牌级别的分析表明,成功的智能体轨迹通常保持在20k-30k令牌以内,且较长的累积上下文与较低的成功率相关,这表明智能体的成功主要源于将任务分解为短上下文步骤,而非有效的长上下文推理。为了直接测试长上下文能力,我们构建了一个数据管道,通过将相关文件放入上下文(确保完美的检索召回率)人为地增加输入上下文长度;随后,我们在真实的长上下文(64k令牌)条件下研究单次补丁生成。尽管采用此设置,性能仍急剧下降:Qwen3-Coder-30B-A3B在64k上下文下仅实现7%的解决率,而GPT-5-nano未能解决任何任务。定性分析揭示了系统性的失败模式,包括幻觉差异、错误的文件目标以及格式错误的补丁头。总体而言,我们的研究结果突显了当前LLMs在名义上下文长度与可用上下文容量之间存在显著差距,并表明现有的智能体编码基准测试并未有效评估长上下文推理能力。