AI coding agents have rapidly transformed software engineering, powering widely used interactive coding assistants. Despite their interactive real-world use, existing benchmarks evaluate them as fully-autonomous systems. In this work, we introduce Dialogue SWE-Bench, an automatic benchmark dataset for evaluating the ability of coding agents to resolve real-world software engineering problems through dialogue with a user. We design a novel, persona-grounded user simulator to support our task evaluation, and augment our task evaluation with automatic evaluations of dialogue quality. We also propose a new schema-guided agent, aimed at improving the dialogue capabilities of off-the-shelf coding agents, which improves over strong baselines by 3-14%. Our results indicate that better coding models do not always correspond to better dialogue models, suggesting that dialogue capability is a distinct and currently understudied dimension of coding agent performance.
翻译:AI编程智能体已迅速改变软件工程领域,推动了广泛使用的交互式编程助手的发展。尽管这些智能体在实际应用中具有交互性,现有基准测试却将其作为完全自主系统进行评估。本研究提出了Dialogue SWE-Bench——一个自动基准测试数据集,旨在评估编程智能体通过与用户对话解决真实软件工程问题的能力。我们设计了一种新颖的基于角色设定的用户模拟器以支持任务评估,并引入了对话质量的自动评估机制以增强任务评估体系。此外,我们提出了一种新的模式引导智能体,旨在提升现成编程智能体的对话能力,相比强基线方法取得了3-14%的性能提升。实验结果表明,更优的编程模型并不总是对应更优的对话模型,这暗示对话能力是编程智能体性能中一个独立且目前尚未被充分研究的维度。