Large language models are increasingly used as programming agents for repository level software engineering tasks. While recent benchmarks evaluate correctness in realistic codebases, they largely treat tasks as independent and do not assess whether agents can reuse experience across related problems. As a result, the ability of agents to accumulate, retrieve, and apply prior experience, as well as the efficiency gains from such reuse, remains difficult to measure. We introduce SWE-ContextBench, a benchmark designed to explicitly evaluate experience reuse in programming agents. Built on SWE-Bench Lite, SWE-ContextBench augments 300 base tasks with 99 related tasks derived from real dependency and reference relationships among GitHub issues and pull requests, forming task sequences with shared context. The benchmark evaluates agents along three complementary dimensions: prediction accuracy, time efficiency, and cost efficiency. Using SWE-ContextBench, we study multiple experience reuse settings, including oracle guided and autonomous retrieval, as well as full execution trajectories and compact summaries. Our results show that correctly selected summarized experience improves resolution accuracy and substantially reduces runtime and token cost, particularly on harder tasks. In contrast, unfiltered or incorrectly selected experience provides limited or negative benefits. These findings highlight the importance of experience representation and retrieval quality, and position SWE-ContextBench as a principled benchmark for studying experience reuse in programming agents.
翻译:大型语言模型越来越多地被用作仓库级软件工程任务的编程智能体。尽管近期的基准测试在真实代码库中评估了正确性,但它们大多将任务视为独立的,并未评估智能体能否在相关问题间复用经验。因此,智能体积累、检索和应用先前经验的能力,以及此类复用带来的效率提升,仍然难以衡量。我们提出了SWE-ContextBench,这是一个专门用于显式评估编程智能体中经验复用能力的基准测试。该基准基于SWE-Bench Lite构建,通过从GitHub问题和拉取请求之间的真实依赖与引用关系中衍生出99个相关任务,对300个基础任务进行了扩展,形成了具有共享上下文的任务序列。该基准从三个互补的维度评估智能体:预测准确性、时间效率和成本效率。利用SWE-ContextBench,我们研究了多种经验复用设置,包括预言引导式检索与自主检索,以及完整执行轨迹与紧凑摘要。我们的结果表明,正确选择的摘要化经验提高了问题解决准确率,并显著减少了运行时间和令牌成本,尤其是在较难的任务上。相比之下,未经筛选或错误选择的经验带来的益处有限甚至为负。这些发现凸显了经验表示与检索质量的重要性,并将SWE-ContextBench确立为一个用于研究编程智能体中经验复用的原则性基准。