LLM-based coding agents have shown strong performance on automated issue resolution benchmarks, yet existing evaluations largely focus on final task success, providing limited insight into how agents retrieve and use code context during problem solving. We introduce ContextBench, a process-oriented evaluation of context retrieval in coding agents. ContextBench consists of 1,136 issue-resolution tasks from 66 repositories across eight programming languages, each augmented with human-annotated gold contexts. We further implement an automated evaluation framework that tracks agent trajectories and measures context recall, precision, and efficiency throughout issue resolution. Using ContextBench, we evaluate four frontier LLMs and five coding agents. Our results show that sophisticated agent scaffolding yields only marginal gains in context retrieval ("The Bitter Lesson" of coding agents), LLMs consistently favor recall over precision, and substantial gaps exist between explored and utilized context. ContextBench augments existing end-to-end benchmarks with intermediate gold-context metrics that unbox the issue-resolution process. These contexts offer valuable intermediate signals for guiding LLM reasoning in software tasks.
翻译:基于大语言模型的编码智能体在自动化问题解决基准测试中展现出强劲性能,然而现有评估主要聚焦于最终任务成功率,对智能体在问题求解过程中如何检索与利用代码上下文提供的洞察有限。本文提出ContextBench——一种面向过程的编码智能体上下文检索评估框架。该基准包含来自8种编程语言、66个代码库的1,136项问题解决任务,每个任务均通过人工标注增强了黄金上下文标注。我们进一步构建了自动化评估框架,可追踪智能体求解轨迹,并在问题解决全过程中测量上下文召回率、精确率与效率指标。基于ContextBench,我们对4个前沿大语言模型及5个编码智能体进行了评估。实验结果表明:复杂智能体框架在上下文检索方面仅带来边际收益(编码智能体的"苦涩教训");大语言模型普遍倾向于召回率而非精确率;已探索上下文与实际使用上下文之间存在显著差距。ContextBench通过引入可揭示问题解决过程内部机制的中间黄金上下文指标,对现有端到端基准测试体系进行了重要补充。这些上下文为引导大语言模型在软件任务中的推理过程提供了宝贵的中间信号。