AI agents are expected to perform professional work across hundreds of occupational domains (from emergency department triage to nuclear reactor safety monitoring to customs import processing), yet existing benchmarks can only evaluate agents in the few domains where public environments exist. We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language Environment Simulators (LESs) that simulate domain-specific environments through LLM-driven tool response generation. Our multi-agent synthesis pipeline automatically produces evaluation instances with guaranteed solvability, calibrated difficulty, and document-grounded diversity. OccuBench evaluates agents along two complementary dimensions: task completion across professional domains and environmental robustness under controlled fault injection (explicit errors, implicit data degradation, and mixed faults). We evaluate 15 frontier models across 8 model families and find that: (1) no single model dominates all industries, as each has a distinct occupational capability profile; (2) implicit faults (truncated data, missing fields) are harder than both explicit errors (timeouts, 500s) and mixed faults, because they lack overt error signals and require the agent to independently detect data degradation; (3) larger models, newer generations, and higher reasoning effort consistently improve performance. GPT-5.2 improves by 27.5 points from minimal to maximum reasoning effort; and (4) strong agents are not necessarily strong environment simulators. Simulator quality is critical for LES-based evaluation reliability. OccuBench provides the first systematic cross-industry evaluation of AI agents on professional occupational tasks.
翻译:AI代理被期望能够在数百个职业领域(从急诊科分诊到核反应堆安全监测再到海关进口处理)执行专业工作,然而现有基准只能在存在公共环境的少数领域对代理进行评估。我们提出了OccuBench,一个涵盖10个行业类别、65个专业领域中100个真实世界专业任务场景的基准,其通过语言环境模拟器(LESs)实现——LESs利用LLM驱动的工具响应生成来模拟特定领域环境。我们的多代理合成流程能够自动生成评估实例,确保其具有可解性、校准难度以及基于文档的多样性。OccuBench从两个互补维度评估代理:跨专业领域的任务完成度,以及在受控故障注入(显式错误、隐式数据退化、混合故障)下的环境鲁棒性。我们评估了8个模型家族中的15个前沿模型,发现:(1) 没有任何单一模型能主导所有行业,每个模型都有独特的职业能力轮廓;(2) 隐式故障(截断数据、缺失字段)比显式错误(超时、500错误)和混合故障更难处理,因为其缺乏明显的错误信号,需要代理独立检测数据退化;(3) 更大规模的模型、更新的世代以及更高的推理努力均能持续提升性能。GPT-5.2从最小推理努力到最大推理努力提升了27.5分;(4) 强代理不一定就是强环境模拟器。模拟器质量对于基于LES的评估可靠性至关重要。OccuBench首次提供了对AI代理在专业职业任务上的系统性跨行业评估。