Current benchmarks for code agents primarily assess narrow, repository-specific fixes, overlooking critical real-world challenges such as cross-repository reasoning, domain-specialized problem solving, dependency-driven migration, and full-repository generation. To address this gap, we introduce BeyondSWE, a comprehensive benchmark that broadens existing evaluations along two axes - resolution scope and knowledge scope - using 500 real-world instances across four distinct settings. Experimental results reveal a significant capability gap: even frontier models plateau below 45% success, and no single model performs consistently across task types. To systematically investigate the role of external knowledge, we develop SearchSWE, a framework that integrates deep search with coding abilities. Our experiments show that search augmentation yields inconsistent gains and can in some cases degrade performance, highlighting the difficulty of emulating developer-like workflows that interleave search and reasoning during coding tasks. This work offers both a realistic, challenging evaluation benchmark and a flexible framework to advance research toward more capable code agents.
翻译:当前代码代理的基准测试主要评估狭窄的、仓库特定的修复,忽略了关键的现实世界挑战,例如跨仓库推理、领域专门化问题解决、依赖驱动迁移以及全仓库生成。为填补这一空白,我们引入了BeyondSWE,这是一个全面的基准测试,它沿着两个轴——解决范围和知识范围——扩展了现有评估,使用了跨越四种不同场景的500个真实世界实例。实验结果显示了一个显著的能力差距:即使是前沿模型的成功率也停滞在45%以下,且没有单一模型在所有任务类型中表现一致。为了系统地研究外部知识的作用,我们开发了SearchSWE,这是一个将深度搜索与编码能力相结合的框架。我们的实验表明,搜索增强带来的收益并不一致,在某些情况下甚至会降低性能,这凸显了在编码任务中模拟开发者那样交织搜索与推理的工作流程的困难。这项工作既提供了一个现实且具有挑战性的评估基准,也提供了一个灵活的框架,以推动研究朝着开发更强大的代码代理迈进。