Agent-Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based Evaluation

We present Agent-Diff, a novel benchmarking framework for evaluating agentic Large Language Models (LLMs) on real-world tasks that execute code via external APIs. Agentic LLM performance varies due to differences in models, external tool access, prompt structures, and agentic frameworks. Benchmarks must make fundamental trade-offs between a sandboxed approach that controls for variation in software environments and more ecologically valid approaches employing real services. Agent-Diff attempts to capture the desirable features of both of these approaches by including access to the real API interfaces for software services while sandboxing the environment in which calls are made, processed, and evaluated. This approach relies on two key innovations. The first is a novel state-diff contract, which separates process from outcome - rather than fuzzy trace or parameter matching, we define task success as whether the expected change in environment state was achieved. The second is a novel sandbox that provides a standardized scripting layer that all models use to execute code against external APIs (Slack, Box, Linear, Google Calendar). Thus, we can evaluate different agentic LLMs against a standardized set of contracts using a unified sandbox while still evaluating their performance on real-world service interfaces. Using the Agent-Diff framework, we provide benchmarks for nine LLMs across 224 tasks utilizing enterprise software workflows. In addition, we evaluate the robustness of the framework with ablation experiments to assess the contribution of access to API documentation on benchmark performance. Code and data: https://github.com/agent-diff-bench/agent-diff.

翻译：本文提出Agent-Diff，一种用于评估大语言模型智能体在通过外部API执行代码的真实任务中表现的新型基准测试框架。大语言模型智能体的性能差异源于模型架构、外部工具访问能力、提示结构以及智能体框架设计等多方面因素。现有基准测试方法需要在控制软件环境变量的沙盒化方案与采用真实服务的生态效度更高方案之间进行根本性权衡。Agent-Diff通过同时具备真实软件服务API接口访问能力与沙盒化调用处理评估环境，尝试融合两种方案的优势特性。该框架依赖两项关键创新：首先是新颖的状态差异契约机制，将执行过程与结果分离——通过定义任务成功标准为预期环境状态变化是否实现，取代模糊的轨迹或参数匹配方法；其次是创新的沙盒系统，为所有模型提供标准化的脚本执行层，使其能够通过统一接口对Slack、Box、Linear、Google Calendar等外部API执行代码操作。基于此，我们能够在保持真实服务接口评估效度的同时，使用统一沙盒环境对不同大语言模型智能体进行标准化契约测试。利用Agent-Diff框架，我们对9个大语言模型在224个企业软件工作流任务中进行了系统性评估。此外，通过消融实验验证了框架的稳健性，量化分析了API文档访问权限对基准测试结果的影响。代码与数据：https://github.com/agent-diff-bench/agent-diff。