IDE-Bench is a comprehensive framework for evaluating AI IDE agents on real-world software engineering tasks through an IDE-native tool interface. We present a Dockerized test harness that goes beyond raw terminal execution, granting models a structured tool ecosystem that represents AI-native IDEs like Cursor and Windsurf. By providing high-level abstractions for codebase search, structured file editing, and tools for testing full-stack applications, IDE-Bench evaluates an agent's ability to act as a true engineering collaborator. For evaluation and to prevent training data contamination, we created 80 tasks across eight never-published repositories spanning C/C++, Java, and MERN stacks, representing modern tech stack production scenarios, including feature implementation, bug fixing, refactoring, and performance optimization that mirror daily developer workflows in private codebases. Our benchmark is the first to systematically correlate agent-reported intent with successful project-level modifications in a multi-language, full-stack environment on completely uncontaminated code.
翻译:IDE-Bench是一个通过IDE原生工具接口评估AI智能体在真实世界软件工程任务中表现的综合框架。我们提出了一个基于Docker的测试环境,该环境超越了原始终端执行模式,为模型提供了一个结构化的工具生态系统,代表了如Cursor和Windsurf等AI原生IDE的功能。通过为代码库搜索、结构化文件编辑以及全栈应用测试工具提供高层抽象,IDE-Bench能够评估智能体作为真正工程协作伙伴的能力。为进行评估并防止训练数据污染,我们在八个从未公开的代码库中创建了80项任务,涵盖C/C++、Java和MERN技术栈,代表了现代技术栈的生产场景,包括功能实现、缺陷修复、代码重构和性能优化,这些任务模拟了私有代码库中开发者的日常工作流程。我们的基准测试首次在多语言、全栈环境的完全未污染代码上,系统性地将智能体报告的意图与成功的项目级修改进行了关联。