Current benchmarks for evaluating software engineering agents, such as SWE-Bench Verified, are predominantly derived from GitHub issues and fail to accurately reflect how developers interact with chat-based coding assistants in integrated development environments (IDEs). We posit that this mismatch leads to a systematic overestimation of agent's capabilities in real-world scenarios, especially bug fixing. We introduce a novel benchmarking framework that transforms existing formal benchmarks into realistic user queries through systematic analysis of developer interaction patterns with chat-based agents. Our methodology is flexible and can be easily extended to existing benchmarks. In this paper, we apply our testing framework to SWE-Bench Verified, the TypeScript subset of Multi-SWE-Bench and a private benchmark, SWE-Bench C# and transform formal GitHub issue descriptions into realistic user-style queries based on telemetry analysis of a popular chat-based agent interactions. Our findings reveal that existing benchmarks significantly overestimate agent capabilities for some models by >50% over baseline performance for public benchmarks and ~10-16% for our internal benchmark. This work establishes a new paradigm for evaluating interactive chat-based software engineering agents through benchmark mutation techniques.
翻译:当前用于评估软件工程智能体的基准(如SWE-Bench Verified)主要源自GitHub问题,未能准确反映开发者在集成开发环境(IDE)中与基于聊天的编码助手交互的真实场景。我们认为这种不匹配导致了对智能体在实际场景(尤其是缺陷修复)中能力的系统性高估。本文提出一种新颖的基准构建框架,该框架通过系统分析开发者与基于聊天的智能体交互模式,将现有形式化基准转化为真实的用户查询。我们的方法具有灵活性,可轻松扩展到现有基准。本文将该测试框架应用于SWE-Bench Verified、Multi-SWE-Bench的TypeScript子集以及私有基准SWE-Bench C#,并基于对流行聊天式智能体交互的遥测分析,将形式化的GitHub问题描述转化为拟真的用户风格查询。研究结果表明:现有基准对某些模型的能力评估存在显著高估,公开基准的性能高估幅度超过基线50%以上,内部基准的高估幅度约为10-16%。这项工作通过基准变异技术,为评估交互式聊天型软件工程智能体建立了新范式。