LLM-based agents struggle to execute complex, multi-step Standard Operating Procedures (SOPs) that are fundamental to industrial automation. Existing benchmarks fail to capture the procedural complexity and tool orchestration demands of real-world workflows. We introduce SOP-Bench, a benchmark of 2,000+ tasks from human expert-authored SOPs across 12 business domains (healthcare, logistics, finance, content moderation, etc.). Using a human-AI collaborative framework, experts crafted authentic SOPs while AI generated artifacts (tools, APIs, datasets), all human-validated, yielding realistic tasks with executable interfaces and ground-truth outputs. SOP-Bench serves as a research enabler for systematically investigating agent architectures, model capabilities, and deployment considerations across diverse procedural tasks. We demonstrate its utility through illustrative experiments with a subset of frontier models across Function-Calling (FC) and ReAct agents, revealing critical insights. For example, (1) newer models do not guarantee better performance - Claude 4 family outperforms Claude 4.5 family on ReAct tasks (Claude 4 Opus: 72.4% vs. Claude 4.5 Sonnet: 63.3% task success rate), demonstrating that production upgrades require validation; (2) no single model-agent combination dominates: best performances range from 57% to 100% depending on domain. These examples illustrate how SOP-Bench enables isolating and studying specific dimensions of agent performance without costly production experiments. Our goal is not to rank model capabilities or build optimal agents, but to provide a rigorous evaluation framework that enables the researchers and practitioners to systematically investigate agent design choices, model selection, and deployment strategies. We release the benchmark at https://github.com/amazon-science/sop-bench.
翻译:基于大语言模型(LLM)的智能体在执行构成工业自动化基础的复杂、多步骤标准操作程序(SOP)方面面临挑战。现有基准测试未能充分捕捉现实世界工作流的程序复杂性及工具编排需求。我们提出了SOP-Bench,这是一个包含来自12个业务领域(医疗保健、物流、金融、内容审核等)由人类专家编写的SOP的2000多项任务的基准测试集。通过采用人机协作框架,专家负责构建真实的SOP,而人工智能则生成相关构件(工具、API、数据集),所有内容均经过人工验证,从而产生了具备可执行接口和真实输出结果的现实任务。SOP-Bench作为一个研究推动者,可用于系统性地研究跨不同程序任务的智能体架构、模型能力及部署考量。我们通过对前沿模型子集在函数调用(FC)和ReAct智能体上进行示例性实验,展示了其实用性,并揭示了关键见解。例如:(1)较新的模型并不保证更好的性能——Claude 4系列在ReAct任务上的表现优于Claude 4.5系列(Claude 4 Opus:72.4% vs. Claude 4.5 Sonnet:63.3%的任务成功率),这表明生产环境升级需要验证;(2)没有单一的模型-智能体组合占主导地位:最佳性能根据领域不同,范围在57%到100%之间。这些例子说明了SOP-Bench如何能够在不进行昂贵的生产实验的情况下,隔离和研究智能体性能的特定维度。我们的目标不是对模型能力进行排名或构建最优智能体,而是提供一个严格的评估框架,使研究人员和实践者能够系统地研究智能体设计选择、模型选择和部署策略。我们在https://github.com/amazon-science/sop-bench发布了该基准测试集。