While Large Language Model (LLM) agents have made remarkable progress on complex reasoning, evaluating them in real-world environments remains an open problem. Existing benchmarks are largely confined to idealized simulations and fail to capture specialized domains such as advertising and marketing analytics, where tasks require multi-round interaction with professional tools and where ground-truth answers quickly become obsolete as data and platform rules evolve. To address this, we propose AD-Bench, a benchmark built from real user marketing-analysis requests on a production advertising platform. AD-Bench introduces two key designs: (i) a dynamic ground-truth pipeline that replays expert tool-call trajectories to regenerate answers consistent with the current environment, mitigating answer obsolescence; and (ii) a trajectory-aware evaluation that jointly measures end-to-end answer correctness (Pass@k) and trajectory coverage. Requests are stratified into three difficulty levels (L1-L3) to probe multi-round, multi-tool collaboration. Experiments show that the best model, Claude-Opus-4.7, attains Pass@1 = 76.9% and Pass@3 = 80.4% with 82.7% trajectory coverage overall, yet drops sharply on L3 to Pass@1 = 61.4% and Pass@3 = 65.1%, revealing that even state-of-the-art agents have substantial gaps in complex advertising analytics.
翻译:暂无翻译