BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?

Guoxin Chen,Fanzhe Meng,Jiale Zhao,Minghao Li,Daixuan Cheng,Huatong Song,Jie Chen,Yuzhi Lin,Hui Chen,Xin Zhao,Ruihua Song,Chang Liu,Cheng Chen,Kai Jia,Ji-Rong Wen

from arxiv, Benchmark: https://huggingface.co/datasets/AweAI-Team/BeyondSWE. Repo: https://github.com/AweAI-Team/BeyondSWE. Scaffold: https://github.com/AweAI-Team/AweAgent

Current code-agent benchmarks primarily evaluate localized issue resolution within a single target repository, leaving under-tested many software engineering tasks that require external knowledge or broader repository-level changes. We introduce BeyondSWE, a 500-instance benchmark drawn from 246 real-world GitHub repositories to evaluate code agents beyond single-repository bug fixing. BeyondSWE covers four representative settings: cross-repository issue resolution, domain-specific issue resolution, dependency-driven migration, and document-to-repository generation, spanning both broader knowledge scope and broader resolution scope. Our evaluation shows that BeyondSWE remains far from saturated: the best OpenHands-based agent reaches 46.12 average score, while the strongest Codex harness with GPT-5.4 (xhigh) reaches 56.65 under a search-aware prompt. To study whether external information access closes this gap, we use SearchSWE as a controlled diagnostic baseline for search-augmented coding. Search access improves most models and substantially helps some tasks, but the gains remain limited and uneven, showing that current agents still struggle to convert retrieved information into precise, version-compatible, and locally actionable code changes. These results suggest that deep search for coding remains an open problem: progress requires agents that can reliably combine external evidence with repository-local reasoning and execution-based verification.

翻译：当前代码智能体基准测试主要评估单一目标仓库内的局部问题修复，未充分测试需要外部知识或更广泛仓库级变更的诸多软件工程任务。我们提出BeyondSWE——一个源自246个真实GitHub仓库的500实例基准测试，用于评估超越单仓库缺陷修复场景的代码智能体。该基准覆盖四种典型场景：跨仓库问题修复、领域特定问题修复、依赖驱动迁移及文档至仓库代码生成，兼具更广泛的知识范围与解决范围。评估显示BeyondSWE远未饱和：最佳OpenHands智能体平均得分为46.12，而采用搜索感知提示的Codex框架（GPT-5.4 xhigh配置）为56.65。为探究外部信息访问能否弥合差距，我们以SearchSWE作为搜索增强编码的受控诊断基线。搜索访问能提升多数模型性能并显著辅助部分任务，但增益有限且不均衡，表明当前智能体仍难以将检索信息转化为精确、版本兼容且可本地执行的代码变更。这些结果揭示：深度搜索编码仍为开放问题——进步需要智能体在整合外部证据的同时，可靠融合仓库本地推理与基于执行的验证能力。