While Large Language Model (LLM) agents have achieved remarkable progress in complex reasoning tasks, evaluating their performance in real-world environments has become a critical problem. Current benchmarks, however, are largely restricted to idealized simulations, failing to address the practical demands of specialized domains like advertising and marketing analytics. In these fields, tasks are inherently more complex, often requiring multi-round interaction with professional marketing tools. To address this gap, we propose AD-Bench, a benchmark designed based on real-world business requirements of advertising and marketing platforms. AD-Bench is constructed from real user marketing analysis requests, with domain experts providing verifiable reference answers and corresponding reference tool-call trajectories. The benchmark categorizes requests into three difficulty levels (L1-L3) to evaluate agents' capabilities under multi-round, multi-tool collaboration. Experiments show that on AD-Bench, Gemini-3-Pro achieves Pass@1 = 68.0% and Pass@3 = 83.0%, but performance drops significantly on L3 to Pass@1 = 49.4% and Pass@3 = 62.1%, with a trajectory coverage of 70.1%, indicating that even state-of-the-art models still exhibit substantial capability gaps in complex advertising and marketing analysis scenarios. AD-Bench provides a realistic benchmark for evaluating and improving advertising marketing agents, the leaderboard and code can be found at https://github.com/Emanual20/adbench-leaderboard.
翻译:尽管大型语言模型(LLM)智能体在复杂推理任务中取得了显著进展,但在真实环境中的性能评估已成为一个关键问题。然而,现有基准主要局限于理想化的模拟场景,未能满足广告与营销分析等专业领域的实际需求。在这些领域中,任务本身更为复杂,通常需要与专业营销工具进行多轮交互。为弥补这一空白,我们提出了AD-Bench——一个基于广告与营销平台真实业务需求设计的基准。AD-Bench构建于真实用户的营销分析请求,由领域专家提供可验证的参考答案及对应的参考工具调用轨迹。该基准将请求划分为三个难度等级(L1-L3),以评估智能体在多轮次、多工具协同下的能力。实验表明,在AD-Bench上,Gemini-3-Pro的Pass@1 = 68.0% 且 Pass@3 = 83.0%,但在L3难度下性能显著下降至Pass@1 = 49.4% 且 Pass@3 = 62.1%,其轨迹覆盖率仅为70.1%,这表明即使最先进的模型在复杂的广告与营销分析场景中仍存在显著的能力差距。AD-Bench为评估和改进广告营销智能体提供了真实的基准,排行榜与代码可通过 https://github.com/Emanual20/adbench-leaderboard 获取。