We present Agent-Diff, a novel benchmarking framework for evaluating agentic Large Language Models (LLMs) on real-world productivity software API tasks via code execution. Agentic LLM performance varies due to differences in models, external tool access, prompt structures, and agentic frameworks. Benchmarks must make fundamental trade-offs between a sandboxed approach that controls for variation in software environments and more ecologically valid approaches employing real services. Agent-Diff attempts to capture the desirable features of both of these approaches by including access to the real API interfaces for software services while sandboxing the environment in which calls are made, processed, and evaluated. This approach relies on two key innovations. The first is a novel state-diff contract, which separates process from outcome - rather than fuzzy trace or parameter matching, we define task success as whether the expected change in environment state was achieved. The second is a novel sandbox built on containerized replicas of enterprise APIs, allowing all models to interact with the same service interfaces through code execution. This enables controlled evaluation against a common set of state-diff contracts while preserving the structure of real-world API interaction. Using the Agent-Diff framework, we provide benchmarks for nine LLMs across 224 tasks utilizing enterprise software workflows. In addition, we evaluate the robustness of the framework with ablation experiments to assess the contribution of access to API documentation on benchmark performance. Code and data: https://github.com/agent-diff-bench/agent-diff.
翻译:本文提出Agent-Diff,一种通过代码执行评估大语言模型(LLM)Agent在真实生产力软件API任务中表现的新型基准测试框架。Agent LLM的性能因模型差异、外部工具访问权限、提示结构及Agent框架的不同而变化。基准测试必须在两种方法间做出根本性权衡:一种是通过沙箱化控制软件环境差异的沙箱方法,另一种是采用真实服务的生态效度更优的方法。Agent-Diff通过结合两种方法的优势特性——既保留软件服务的真实API接口访问,又对API调用、处理及评估环境进行沙箱化隔离——试图同时实现两者的理想特征。该方法依赖两项关键创新:其一是新型状态差异契约,将过程与结果分离——不同于模糊追踪或参数匹配,我们将任务成功定义为环境状态是否达到预期改变;其二是基于企业API容器化副本构建的新型沙箱,使所有模型能通过代码执行与相同服务接口交互。这使得评估可在统一状态差异契约集下受控进行,同时保留真实API交互的结构特征。基于Agent-Diff框架,我们针对224个企业软件工作流任务对9种LLM提供了基准测试结果。此外,通过消融实验评估了框架的鲁棒性,探究API文档访问对基准性能的贡献。代码与数据:https://github.com/agent-diff-bench/agent-diff。