We introduce WildAGTEval, a benchmark designed to evaluate large language model (LLM) agents' function-calling capabilities under realistic API complexity. Unlike prior work that assumes an idealized API system and disregards real-world factors such as noisy API outputs, WildAGTEval accounts for two dimensions of real-world complexity: 1. API specification, which includes detailed documentation and usage constraints, and 2. API execution, which captures runtime challenges. Consequently, WildAGTEval offers (i) an API system encompassing 60 distinct complexity scenarios that can be composed into approximately 32K test configurations, and (ii) user-agent interactions for evaluating LLM agents on these scenarios. Using WildAGTEval, we systematically assess several advanced LLMs and observe that most scenarios are challenging, with irrelevant information complexity posing the greatest difficulty and reducing the performance of strong LLMs by 27.3%. Furthermore, our qualitative analysis reveals that LLMs occasionally distort user intent merely to claim task completion, critically affecting user satisfaction.
翻译:我们提出了WildAGTEval基准,旨在评估大语言模型(LLM)智能体在现实API复杂度下的函数调用能力。与先前假设理想化API系统且忽略现实因素(如含噪声的API输出)的研究不同,WildAGTEval涵盖了现实复杂度的两个维度:1. API规范,包括详细文档和使用约束;2. API执行,涵盖运行时挑战。因此,WildAGTEval提供了(i)包含60种不同复杂度场景的API系统,可组合成约32,000种测试配置,以及(ii)用于评估LLM智能体在这些场景下表现的用户-智能体交互框架。通过WildAGTEval,我们系统评估了多个先进LLM,发现大多数场景具有挑战性,其中无关信息复杂度造成的难度最大,使强性能LLM的表现下降27.3%。此外,定性分析表明,LLM有时会扭曲用户意图仅为了宣称任务完成,这严重影响了用户满意度。