OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation

AI agents are expected to perform professional work across hundreds of occupational domains (from emergency department triage to nuclear reactor safety monitoring to customs import processing), yet existing benchmarks can only evaluate agents in the few domains where public environments exist. We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language Environment Simulators (LESs) that simulate domain-specific environments through LLM-driven tool response generation. Our multi-agent synthesis pipeline automatically produces evaluation instances with guaranteed solvability, calibrated difficulty, and document-grounded diversity. OccuBench evaluates agents along two complementary dimensions: task completion across professional domains and environmental robustness under controlled fault injection (explicit errors, implicit data degradation, and mixed faults). We evaluate 15 frontier models across 8 model families and find that: (1) no single model dominates all industries, as each has a distinct occupational capability profile; (2) implicit faults (truncated data, missing fields) are harder than both explicit errors (timeouts, 500s) and mixed faults, because they lack overt error signals and require the agent to independently detect data degradation; (3) larger models, newer generations, and higher reasoning effort consistently improve performance. GPT-5.2 improves by 27.5 points from minimal to maximum reasoning effort; and (4) strong agents are not necessarily strong environment simulators. Simulator quality is critical for LES-based evaluation reliability. OccuBench provides the first systematic cross-industry evaluation of AI agents on professional occupational tasks.

翻译：AI代理被期望能够在数百个职业领域（从急诊科分诊到核反应堆安全监测再到海关进口处理）执行专业工作，然而现有基准只能在存在公共环境的少数领域对代理进行评估。我们提出了OccuBench，一个涵盖10个行业类别、65个专业领域中100个真实世界专业任务场景的基准，其通过语言环境模拟器（LESs）实现——LESs利用LLM驱动的工具响应生成来模拟特定领域环境。我们的多代理合成流程能够自动生成评估实例，确保其具有可解性、校准难度以及基于文档的多样性。OccuBench从两个互补维度评估代理：跨专业领域的任务完成度，以及在受控故障注入（显式错误、隐式数据退化、混合故障）下的环境鲁棒性。我们评估了8个模型家族中的15个前沿模型，发现：(1) 没有任何单一模型能主导所有行业，每个模型都有独特的职业能力轮廓；(2) 隐式故障（截断数据、缺失字段）比显式错误（超时、500错误）和混合故障更难处理，因为其缺乏明显的错误信号，需要代理独立检测数据退化；(3) 更大规模的模型、更新的世代以及更高的推理努力均能持续提升性能。GPT-5.2从最小推理努力到最大推理努力提升了27.5分；(4) 强代理不一定就是强环境模拟器。模拟器质量对于基于LES的评估可靠性至关重要。OccuBench首次提供了对AI代理在专业职业任务上的系统性跨行业评估。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

代码即代理基础设施：迈向可执行、可验证、有状态的AI代理系统

专知会员服务

17+阅读 · 5月20日

构建面向终端的 AI 编程智能体：脚手架、测试环境、上下文工程及实践经验

专知会员服务

25+阅读 · 3月8日

EdgeRunner AI：在本地设备关键军事任务中实现GPT-5级性能表现（附论文）

专知会员服务

29+阅读 · 2025年11月19日

《信息战中基于大语言模型的AI代理红蓝队对抗沙盒方法：探索反信息、提示注入与AI素养中的人类控制》最新报告

专知会员服务

27+阅读 · 2025年5月29日