Current benchmarks for evaluating software engineering agents, such as SWE-Bench Verified, are predominantly derived from GitHub issues and fail to accurately reflect how developers interact with chat-based coding assistants in integrated development environments (IDEs). We posit that this mismatch leads to a systematic overestimation of agent's capabilities in real-world scenarios, especially bug fixing. We introduce a novel benchmarking framework that transforms existing formal benchmarks into realistic user queries through systematic analysis of developer interaction patterns with chat-based agents. Our methodology is flexible and can be easily extended to existing benchmarks. In this paper, we apply our testing framework to SWE-Bench Verified, the TypeScript subset of Multi-SWE-Bench and a private benchmark, SWE-Bench C# and transform formal GitHub issue descriptions into realistic user-style queries based on telemetry analysis of a popular chat-based agent interactions. Our findings reveal that existing benchmarks significantly overestimate agent capabilities for some models by >50% over baseline performance for public benchmarks and ~10-16% for our internal benchmark. This work establishes a new paradigm for evaluating interactive chat-based software engineering agents through benchmark mutation techniques.
翻译:当前用于评估软件工程智能体的基准(如SWE-Bench Verified)主要源自GitHub问题,未能准确反映开发者在集成开发环境(IDE)中与基于聊天的编码助手交互的实际模式。我们认为这种不匹配导致了对智能体在真实场景(尤其是错误修复)中能力的系统性高估。本文提出一种新颖的基准测试框架,该框架通过对开发者与聊天智能体交互模式的系统性分析,将现有形式化基准转化为贴近现实的用户查询。我们的方法具有灵活性,可轻松扩展到现有基准。本文将该测试框架应用于SWE-Bench Verified、Multi-SWE-Bench的TypeScript子集以及私有基准SWE-Bench C#,并基于对流行聊天智能体交互的遥测分析,将形式化的GitHub问题描述转化为拟真的用户风格查询。研究结果表明,现有基准对部分模型能力的高估程度显著:公开基准的性能高估超过基线50%以上,内部基准的高估幅度约为10-16%。这项工作通过基准变异技术,为评估交互式聊天软件工程智能体建立了新的范式。