IDE-Bench is a comprehensive framework for evaluating AI IDE agents on real-world software engineering tasks through an IDE-native tool interface. We present a Dockerized test harness that goes beyond raw terminal execution, granting models a structured tool ecosystem that represents AI-native IDEs like Cursor and Windsurf. By providing high-level abstractions for codebase search, structured file editing, and tools for testing full-stack applications, IDE-Bench evaluates an agent's ability to act as a true engineering collaborator. For evaluation and to prevent training data contamination, we created 80 tasks across eight never-published repositories spanning C/C++, Java, and MERN stacks, representing modern tech stack production scenarios, including feature implementation, bug fixing, refactoring, and performance optimization that mirror daily developer workflows in private codebases. Our benchmark is the first to systematically correlate agent-reported intent with successful project-level modifications in a multi-language, full-stack environment on completely uncontaminated code. We release IDE-Bench and a public leaderboard at: https://ide-bench.com.
翻译:IDE-Bench是一个通过IDE原生工具接口评估AI智能体在真实世界软件工程任务中表现的综合框架。我们提出了一个基于Docker的测试环境,它超越了原始终端执行,为模型提供了一个结构化的工具生态系统,代表了Cursor和Windsurf等AI原生IDE的特性。通过提供代码库搜索、结构化文件编辑以及全栈应用测试工具的高层抽象,IDE-Bench评估了智能体作为真正工程协作伙伴的能力。为进行评估并防止训练数据污染,我们在八个从未公开的代码库中创建了80项任务,涵盖C/C++、Java和MERN技术栈,代表了现代技术栈的生产场景,包括功能实现、缺陷修复、代码重构和性能优化,这些任务模拟了私有代码库中开发人员的日常工作流程。我们的基准测试首次在多语言、全栈环境的完全未污染代码上,系统地将智能体报告的意图与成功的项目级修改关联起来。我们在https://ide-bench.com发布了IDE-Bench框架及公开排行榜。