AI agents may soon become capable of autonomously completing valuable, long-horizon tasks in diverse domains. Current benchmarks either do not measure real-world tasks, or are not sufficiently difficult to meaningfully measure frontier models. To this end, we present Terminal-Bench 2.0: a carefully curated hard benchmark composed of 89 tasks in computer terminal environments inspired by problems from real workflows. Each task features a unique environment, human-written solution, and comprehensive tests for verification. We show that frontier models and agents score less than 65\% on the benchmark and conduct an error analysis to identify areas for model and agent improvement. We publish the dataset and evaluation harness to assist developers and researchers in future work at https://www.tbench.ai/ .
翻译:人工智能智能体可能很快将能够在多样化领域中自主完成有价值的长周期任务。当前基准测试要么未能衡量真实世界任务,要么难度不足以有效评估前沿模型。为此,我们提出终端基准2.0:一个精心构建的困难基准测试集,包含89个受真实工作流程问题启发的计算机终端环境任务。每个任务都具有独特的环境配置、人工编写的解决方案以及用于验证的全面测试集。实验表明前沿模型与智能体在该基准测试中得分低于65%,我们通过错误分析确定了模型与智能体需要改进的方向。我们在https://www.tbench.ai/ 发布了完整数据集与评估框架,以助力开发者和研究者的后续工作。